How’s it going?

QTraining details

Table 28 reports our training configurations. Across all RL runs, we use LoRA rank 32, α\alpha 64, applied to all-linear modules; Dr. GRPO group size 64; 8 prompts per batch; 1024-token rollouts; 15-turn mazes with rewards +20/10/0.1+20/-10/-0.1⁠; 10% wind; 10 rollouts per prompt; sampling temperature 0.7⁠.

Run Algorithm Base LR LoRA r/αr/\alpha Ent. Step Notes
4B Instruct Dr. GRPO (main) Dr. GRPO 4B-Ins 3e-63\text{e-}6 32/64 0.01 95 primary reference
4B Instruct Dr. GRPO (swap) Dr. GRPO 4B-Ins 3e-63\text{e-}6 32/64 0.01 50 Mold\leftrightarrowGold
4B Base Dr. GRPO 4B-Base 3e-53\text{e-}5 32/64 0 101 no eq-ent bonus
4B Base (shuffle) Dr. GRPO 4B-Base 3e-53\text{e-}5 32/64 0 101 3-way permuted tiles
8B Cardinal Dr. GRPO Dr. GRPO 8B 3e-63\text{e-}6 32/64 0.05 65 higher ent. coef.
4B Instruct SFT SFT 4B-Ins 2e-52\text{e-}5 32/64 n/a 9375 50k traj, 3 epochs
GPT-OSS-20B (Tinker) Tinker GRPO 20B 3e-53\text{e-}5 32/32 0.01 160 11-turn, 512 tok
4B Token REINFORCE REINFORCE 4B-Ins 3e-63\text{e-}6 32/64 0.2 115 high ent. for variance
4B Instruct FFT SFT SFT 4B-Ins 1e-51\text{e-}5 n/a n/a 3125 50k traj, 1 epoch
4B Instruct FFT RL Dr. GRPO 4B-Ins 3e-63\text{e-}6 n/a 0.01 95 group size 32; 32 prompts/batch
Table 28. Per-run hyperparameters for each checkpoint the body reports on. “Ent.” is the equalized entropy bonus regularization coefficient (Appendix J.3). “Step” is the checkpoint index used for concept vector extraction.

Q.1Token-level REINFORCE

A possible worry with our results is that perhaps only Dr. GRPO recruits the functional welfare axis, rather than RL in general. We therefore train a model via token-level REINFORCE. While Dr. GRPO applies a single trajectory-level advantage to all tokens, token-level REINFORCE assigns each action its own per-position advantage built from a returns-to-go signal and a per-position group baseline.

Per-turn reward signal.

Each rollout produces a sequence of turns t=0,1,,Ti1t = 0, 1, \dots, T_{i} - 1 for trajectory ii⁠. At every turn the agent emits a single direction token and the environment returns a scalar reward

ri,t  =  rstep  +  rGoldΔgi,t  +  rMoldΔi,t,r_{i,t}\;=\; r_{\text{step}}\;+\; r_{\Gold{}}\cdot \Delta g_{i,t}\;+\; r_{\Mold{}}\cdot \Delta\ell_{i,t},

where Δgi,t\Delta g_{i,t} and Δi,t\Delta\ell_{i,t} are the number of Gold tiles consumed and Mold tiles entered on turn tt respectively, and the constants take the values rstep=0.1r_{\text{step}}= -0.1⁠, rGold=+20r_{\Gold{}}= +20⁠, rMold=10r_{\Mold{}}= -10 also used by Dr. GRPO.

Returns-to-go.

We use undiscounted returns-to-go (γ=1\gamma = 1⁠):

Gi,t  =  k=tTi1ri,k.G_{i,t}\;=\; \sum_{k=t}^{T_i - 1}r_{i,k}.

Gi,tG_{i,t} is the trajectory’s reward summed from the current turn through termination, so the action emitted at turn tt is credited with all downstream consequences of the policy’s choices from tt onward.

Per-position group baseline.

For every batch we sample G|\mathcal{G}| rollouts per maze prompt (the GRPO group). Within a group gg⁠, we let G\mathcal{G} for its index set and let mi,t{0,1}m_{i,t}\in \{0,1\} be the response mask (1 at valid turn positions, 0 at padding for trajectories that terminated before the maximum turn limit). The baseline at turn position tt is the mask-weighted mean return-to-go across that group’s still-active members,

bt(g)  =  iGmi,tGi,tmax ⁣(iGmi,t,1),b^{(g)}_{t}\;=\; \frac{\sum_{i \in \mathcal{G}}m_{i,t}\, G_{i,t}}{\max\!\Bigl(\sum_{i \in \mathcal{G}}m_{i,t},\, 1\Bigr)},

and the per-turn advantage is

Ai,t  =  (Gi,tbt(g(i)))mi,t.A_{i,t}\;=\; \bigl(G_{i,t}- b^{(g(i))}_{t}\bigr)\, m_{i,t}.
Token-level PPO surrogate.

Following Ye et al. [31][31]Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, and et al. Mastering complex control in moba games with deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 340 (04):0 6672–6679, Apr. 2020. 10.1609/aaai.v34i04.6144. URL https://ojs.aaai.org/index.php/AAAI/article/view/6144., we use a dual-clipping PPO variant. Let πθ\pi_{\theta} denote the current policy and πθold\pi_{\theta_{\text{old}}} the rollout-time policy. The importance ratio at turn tt of trajectory ii is

ρi,t  =  exp ⁣(logπθ(ai,tsi,t)logπθold(ai,tsi,t)),\rho_{i,t}\;=\; \exp\!\bigl(\log \pi_{\theta}(a_{i,t}\mid s_{i,t}) - \log \pi_{\theta_{\text{old}}}(a_{i,t}\mid s_{i,t})\bigr),

clamped in log-space to [20,20][-20, 20] for numerical stability. We use the same dual-clip surrogate as Dr. GRPO, applied with per-turn advantages instead of trajectory advantages:

Li,tclip  =  {max ⁣(ρi,tAi,t,ρˉi,tAi,t)if Ai,t0,min ⁣(max(ρi,tAi,t,ρˉi,tAi,t),cAi,t)if Ai,t<0,L^{\text{clip}}_{i,t}\;=\; \begin{cases}\max\!\bigl(-\rho_{i,t}A_{i,t},\, -\bar\rho_{i,t}A_{i,t}\bigr)&\text{if }A_{i,t}\ge 0, \\[2pt] \min\!\bigl(\max(-\rho_{i,t}A_{i,t},\, -\bar\rho_{i,t}A_{i,t}),\, -c\, A_{i,t}\bigr)&\text{if }A_{i,t}< 0,\end{cases}

where ρˉi,t=clip(ρi,t,1ϵ,1+ϵ)\bar\rho_{i,t}= \mathrm{clip}(\rho_{i,t}, 1{-}\epsilon, 1{+}\epsilon)⁠, ϵ=0.2\epsilon = 0.2⁠, and c=3.0c = 3.0⁠.

Differences from Dr. GRPO.

Token-level REINFORCE differs from our Dr. GRPO setup in three places:

  1. the reward signal stored per turn (ri,tr_{i,t} versus the end-of-trajectory return Ri=tri,tR_{i} = \sum_{t} r_{i,t}⁠);

  2. the advantage construction (Ai,t=Gi,tbt(g)A_{i,t}= G_{i,t}- b^{(g)}_{t} versus Ai=RiRˉ(g)A_{i} = R_{i} - \bar R^{(g)}⁠, with no token-level credit assignment in Dr. GRPO);

  3. the broadcast of advantages onto the surrogate (per-turn Ai,tA_{i,t} versus replicating AiA_{i} across all turns).

The clipped surrogate, KL penalty, equalized entropy bonus, normalization constant ZZ⁠, optimizer, schedules, and rollout setup are identical between the two trainers.

Figure 43 shows the training signal for each row of Table 28.

Figure 43Figure 43Figure 43Figure 43Figure 43Figure 43Figure 43Figure 43Figure 43Figure 43
Figure 43. Training curves for every checkpoint listed in Table 28.