QTraining details¶
Table 28 reports our training configurations. Across all RL runs, we use LoRA rank 32, 64, applied to all-linear modules; Dr. GRPO group size 64; 8 prompts per batch; 1024-token rollouts; 15-turn mazes with rewards ; 10% wind; 10 rollouts per prompt; sampling temperature 0.7.
| Run | Algorithm | Base | LR | LoRA | Ent. | Step | Notes |
|---|---|---|---|---|---|---|---|
| 4B Instruct Dr. GRPO (main) | Dr. GRPO | 4B-Ins | 32/64 | 0.01 | 95 | primary reference | |
| 4B Instruct Dr. GRPO (swap) | Dr. GRPO | 4B-Ins | 32/64 | 0.01 | 50 | MoldGold | |
| 4B Base | Dr. GRPO | 4B-Base | 32/64 | 0 | 101 | no eq-ent bonus | |
| 4B Base (shuffle) | Dr. GRPO | 4B-Base | 32/64 | 0 | 101 | 3-way permuted tiles | |
| 8B Cardinal Dr. GRPO | Dr. GRPO | 8B | 32/64 | 0.05 | 65 | higher ent. coef. | |
| 4B Instruct SFT | SFT | 4B-Ins | 32/64 | n/a | 9375 | 50k traj, 3 epochs | |
| GPT-OSS-20B (Tinker) | Tinker GRPO | 20B | 32/32 | 0.01 | 160 | 11-turn, 512 tok | |
| 4B Token REINFORCE | REINFORCE | 4B-Ins | 32/64 | 0.2 | 115 | high ent. for variance | |
| 4B Instruct FFT SFT | SFT | 4B-Ins | n/a | n/a | 3125 | 50k traj, 1 epoch | |
| 4B Instruct FFT RL | Dr. GRPO | 4B-Ins | n/a | 0.01 | 95 | group size 32; 32 prompts/batch |
Q.1Token-level REINFORCE¶
A possible worry with our results is that perhaps only Dr. GRPO recruits the functional welfare axis, rather than RL in general. We therefore train a model via token-level REINFORCE. While Dr. GRPO applies a single trajectory-level advantage to all tokens, token-level REINFORCE assigns each action its own per-position advantage built from a returns-to-go signal and a per-position group baseline.
Per-turn reward signal.¶
Each rollout produces a sequence of turns for trajectory . At every turn the agent emits a single direction token and the environment returns a scalar reward
where and are the number of Gold tiles consumed and Mold tiles entered on turn respectively, and the constants take the values , , also used by Dr. GRPO.
Returns-to-go.¶
We use undiscounted returns-to-go ():
is the trajectory’s reward summed from the current turn through termination, so the action emitted at turn is credited with all downstream consequences of the policy’s choices from onward.
Per-position group baseline.¶
For every batch we sample rollouts per maze prompt (the GRPO group). Within a group , we let for its index set and let be the response mask (1 at valid turn positions, 0 at padding for trajectories that terminated before the maximum turn limit). The baseline at turn position is the mask-weighted mean return-to-go across that group’s still-active members,
and the per-turn advantage is
Token-level PPO surrogate.¶
Following Ye et al. [31][31]Deheng Ye, Zhao Liu, Mingfei Sun, Bei Shi, Peilin Zhao, Hao Wu, Hongsheng Yu, Shaojie Yang, Xipeng Wu, Qingwei Guo, and et al. Mastering complex control in moba games with deep reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 340 (04):0 6672–6679, Apr. 2020. 10.1609/aaai.v34i04.6144. URL https://ojs.aaai.org/index.php/AAAI/article/view/6144., we use a dual-clipping PPO variant. Let denote the current policy and the rollout-time policy. The importance ratio at turn of trajectory is
clamped in log-space to for numerical stability. We use the same dual-clip surrogate as Dr. GRPO, applied with per-turn advantages instead of trajectory advantages:
where , , and .
Differences from Dr. GRPO.¶
Token-level REINFORCE differs from our Dr. GRPO setup in three places:
the reward signal stored per turn ( versus the end-of-trajectory return );
the advantage construction ( versus , with no token-level credit assignment in Dr. GRPO);
the broadcast of advantages onto the surrogate (per-turn versus replicating across all turns).
The clipped surrogate, KL penalty, equalized entropy bonus, normalization constant , optimizer, schedules, and rollout setup are identical between the two trainers.
Figure 43 shows the training signal for each row of Table 28.