How’s it going?

MComparison of concept vector norms

All of our steering interventions add a scalar multiple of a concept vector to the residual stream. The effective strength of a given α\alpha therefore depends on the L2L^{2} norm of the vector being added, which varies. Table 27 reports those norms for the reward vectors.

Checkpoint Mold control Gold control Mold Gold
Qwen3 4B Instruct Dr. GRPO 4.46 (L22) 7.51 (L22) 4.07 (L20) 14.37 (L22)
Qwen3 4B Instruct Dr. GRPO (tiles swapped) 3.79 (L21) 10.85 (L22)
Qwen3 4B Instruct Dr. GRPO (FFT) 3.70 (L21) 9.87 (L22)
Qwen3 4B Instruct SFT 19.71 (L22) 19.54 (L22)
Qwen3 4B Instruct SFT (FFT) 44.61 (L31) 18.40 (L20)
Qwen3 4B Instruct REINFORCE 5.79 (L20) 27.09 (L24)
Qwen3 4B Base 4.03 (L22) 6.85 (L23) 16.65 (L23) 14.31 (L21)
Qwen3 4B Base (tiles swapped) 16.12 (L22) 13.28 (L21)
Qwen3 8B 7.52 (L22) 13.53 (L22) 14.28 (L20) 23.41 (L21)
GPT-OSS-20B Dr. GRPO 149.08 (L16) 343.70 (L17) 991.65 (L16) 442.13 (L14)
Table 27. L2 norms of extracted concept vectors at their auto-selected steering layer. Controls are extracted once per underlying base model; ditto marks ( ” ) indicate a duplicate control shared with an earlier row on the same base model. Note that GPT-OSS-20B is in a different norm regime (100\sim 100⁠–1000⁠) than the Qwen3 family (4\sim 4⁠–30⁠).

M.1Norm-matching control-vector steering

The steering evaluations in §4 compare the trained reward vectors vc\mathbf{v}_{c} against the maze-naive control concept vectors uc\mathbf{u}_{c} extracted by the same pipeline (§2.3). Because uc\mathbf{u}_{c} and vc\mathbf{v}_{c} are extracted by the same pipeline from checkpoints that differ only in whether maze training has occurred, the controls isolate the contribution of maze training to the downstream steering effects. A random-direction baseline could rule out “any vector steers,” but uc\mathbf{u}_{c} is a more targeted control: it additionally rules out artifacts of the extraction pipeline itself (e.g. the trajectory construction or the difference-in-means procedure) and of the emoji tokens’ pre-existing representations. Any steering effect present for vc\mathbf{v}_{c} but absent for uc\mathbf{u}_{c} is therefore attributable to how maze training reshaped the model’s representations of the rewarded tiles.

A concern is that the trained vectors and the controls generally have different L2L^{2} norms (Table 27). We therefore norm-match: control vectors are steered with scaled factors β\beta chosen so that βuc=αvc\beta \lVert \mathbf{u}_{c} \rVert = \alpha \lVert \mathbf{v}_{c} \rVert⁠, so the residual-stream perturbation has the same magnitude at every nominal α\alpha⁠. In figures we plot the control vectors at their nominal α\alpha rather than the scaled β\beta⁠, so trained and control curves can be read against the same horizontal axis. This is the convention used wherever a trained reward vector is compared against a control in the body steering evaluations (§4); it is also reused for the maze-naive control comparisons in Appendix C, for the VAA comparison in Appendix H, and for the sentiment-subspace decomposition in Appendix F.5.

We do not renormalize the trained reward vectors themselves; their norms could be signals of representation strength.