How’s it going?

LExtraction and evaluation details

This appendix collects various methodological details deferred from §2.3 and §4: how the off-policy trajectories used for concept-vector extraction are constructed (Appendix L.1), a check that the four final-move directions are balanced across the three terminal-tile classes (Appendix L.2), how the residual-stream layer is chosen for each of the four analyses that need one (steering, logit-lens, emotion scatter, tile-mean cosine), and the prompt and rollout counts per evaluation that feed the steering results of §4 (Table 26).

L.1Off-policy trajectory construction

We capture activations from “off-policy” programmatically-generated trajectories §2.3 for the purposes of concept vector extraction. We use this construction because it is the only way to guarantee that the only systematic difference between the three activation classes is the tile type of the final step.

Every time we extract concept vectors, we generate 5,000 trajectories per tile class (Mold\Mold{}⁠, Gold\Gold{}⁠, Path\Path{}⁠) for a total of 15,000 trajectories, distributed evenly across step counts n{1,,15}n \in \{1, \dots, 15\}⁠. Each trajectory uses its own freshly-generated maze. Mazes are produced by the same generator used in training (Appendix J), but with the different seed base_seed=474747\text{base\_seed}= 474747 and incremented per maze for reproducibility.

Given a maze and target (n,c)(n, c) where c{Mold,Gold,Path}c \in \{\Mold{}, \Gold{}, \Path{}\}⁠, we run a constrained random walk from the agent’s start position: the first n1n-1 steps choose uniformly among adjacent Path\Path{} tiles, and the final step is chosen to land on a tile of type cc⁠. If no such walk exists in a given maze, we discard it and draw a fresh maze. For Gold\Gold{} trajectories with n<n < optimal-path-length, or with mismatched parity, the maze is rejected immediately. The final trajectory is rendered into the same multi-turn chat format used at training time, with each turn an exchange of (user prompt describing the four adjacent tiles, assistant single-letter move in {N,E,S,W}\{\texttt{N}, \texttt{E}, \texttt{S}, \texttt{W}\}⁠).

Each formatted trajectory is tokenized under the model’s chat template, with no  ⁣im_end ⁣\langle\!|\text{im\_end}|\!\rangle appended after the final assistant move. We capture at token position −1⁠, the last direction letter the agent generated. We capture the residual stream at every transformer block at this position. The Mold\Mold{} and Gold\Gold{} concept vectors are then computed as the per-layer differences of class means, as in Equation 1.

L.2Class balance of the extracted trajectories

A potential confound is if e.g. Mold\Mold{} trajectories disproportionately ended with S relative to Gold\Gold{} trajectories, then vMold\Vmold{} could partly encode the “the model just emitted S”.

We verify this is not the case:

Final tile nn N E S W
Mold\Mold{} 5000 24.54% 25.10% 25.38% 24.98%
Gold\Gold{} 5000 24.70% 24.52% 25.52% 25.26%
Path\Path{} 5000 24.48% 26.22% 24.42% 24.88%
overall 15000 24.57% 25.28% 25.11% 25.04%

Layer selection

We select layers in three different ways depending on the analysis. For concept-vector steering (sentiment, refusal, backtracking, calibration), we pick a single steering layer per (checkpoint, concept) pair from empirical separability metrics computed on held-out activations [33, 20][33]Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai transparency, 2025. URL https://arxiv.org/abs/2310.01405.[20]Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=aajyHYjjsk.. For logit-lens unembedding and tile-mean geometry, we use depth-fraction heuristics (5L/6\lfloor 5L/6 \rfloor and 2L/3\lfloor 2L/3 \rfloor respectively) consistent with prior findings on where high-level conceptual information is in the residual stream [21][21]Nostalgebraist. Interpreting GPT: The logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. LessWrong..

Steering layer.

Concept-vector extraction runs the trained agent on its tile-classification dataset and stores the difference of class means at every transformer block. With LL blocks and residual width dd⁠, this yields, for each concept c{mold,gold,path}c \in \{\textsc{mold}, \textsc{gold}, \textsc{path}\} and checkpoint, per-layer concept vectors

v(c)  =  μ(c,+)μ(c,)    Rd,μ(c,±)  =  1D±(c)xD±(c)h()(x),=0,,L1,v^{(c)}_{\ell} \;=\; \mu^{(c,+)}_{\ell} - \mu^{(c,-)}_{\ell} \;\in\; \mathbb{R}^{d}, \qquad \mu^{(c,\pm)}_{\ell} \;=\; \frac{1}{|\mathcal{D}^{(c)}_{\pm}|}\sum_{x \in \mathcal{D}^{(c)}_\pm}h^{(\ell)}(x), \qquad \ell = 0, \dots, L{-}1,

where D±(c)\mathcal{D}^{(c)}_{\pm} are the tiles labeled positive/negative for concept cc and h()h^{(\ell)} is the residual stream at the output of block \ell⁠. To choose a single \ell^{\star} per concept we then project held-out positive and negative samples onto each candidate vv_{\ell}⁠, yielding two empirical 1-D distributions {si,+=h()(xi+),v}i=1n+\{s^{+}_{i,\ell}= \langle h^{(\ell)}(x^{+}_{i}), v_{\ell} \rangle\}_{i=1}^{n_+} and {sj,}j=1n\{s^{-}_{j,\ell}\}_{j=1}^{n_-}⁠, and compute three layer-wise scalars:

AUROC()  =  Pr ⁣[sI,+>sJ,],d()  =  sˉ+sˉspool,,ovl()  =  b=1Bmin ⁣(p^b+(),p^b()),\mathrm{AUROC}(\ell) \;=\; \Pr\!\bigl[s^{+}_{I,\ell}> s^{-}_{J,\ell}\bigr], \qquad d(\ell) \;=\; \frac{\bar s^{+}_{\ell} - \bar s^{-}_{\ell}}{s_{\text{pool},\ell}}, \qquad \mathrm{ovl}(\ell) \;=\; \sum_{b=1}^{B}\min\!\bigl(\hat p^{+}_{b}(\ell),\, \hat p^{-}_{b}(\ell)\bigr),

where spool,s_{\text{pool},\ell} is the pooled standard deviation of the two projection samples (i.e. dd is Cohen’s dd⁠); ovl\mathrm{ovl} is the histogram overlap of cosine similarities ci,±=h()(xi),v/(h()(xi)v)c^{\pm}_{i,\ell}= \langle h^{(\ell)}(x_{i}), v_{\ell} \rangle / (\|h^{(\ell)}(x_{i})\|\,\|v_{\ell}\|) binned into B=50B = 50 bins on the joint range. We take the per-metric optima

AUROC=argmaxAUROC(),d=argmaxd(),ovl=argminovl(),\ell^{\star}_{\mathrm{AUROC}}= \arg\max_{\ell} \mathrm{AUROC}(\ell), \qquad \ell^{\star}_{d} = \arg\max_{\ell} \lvert d(\ell) \rvert, \qquad \ell^{\star}_{\mathrm{ovl}}= \arg\min_{\ell} \mathrm{ovl}(\ell),

and define the chosen layer as the floor of their unweighted mean,

  =  13(AUROC+d+ovl).\ell^{\star} \;=\; \left\lfloor \tfrac{1}{3}\bigl(\ell^{\star}_{\mathrm{AUROC}}+ \ell^{\star}_{d} + \ell^{\star}_{\mathrm{ovl}}\bigr) \right\rfloor.

Each (checkpoint, concept) pair receives its own \ell^{\star}⁠; this is the layer used for every concept-vector steering evaluation. The precise choice of \ell^{\star} is not very important: Appendix D verifies that sweeping \ell over all 36 layers of the primary 4B Dr. GRPO checkpoint produces a wide band of layers ([17,26]\ell \in [17, 26]⁠) over which the steering effects of §4 still appear.

Logit-lens layer.

For logit-lens unembedding (§3.2 and Appendices F, B, C) we project each per-layer concept vector v(c)v^{(c)}_{\ell} through the model’s unembedding matrix WURV×dW_{U} \in \mathbb{R}^{|V| \times d} to read off top-kk promoted and suppressed tokens, evaluated at a single depth-fraction layer

LL  =  5L/6.\ell_{\mathrm{LL}}\;=\; \lfloor 5L/6 \rfloor.

For Qwen3-4B and Qwen3-8B (L=36L = 36⁠) this gives layer 30; for GPT-OSS-20B (L=24L = 24⁠) it gives layer 20.

Emotion-scatter layer.

For the emotion projection analyses (§3.3 and Appendix C), we pick the joint-AUROC argmax

emo  =  argmax12(AUROCmold()+AUROCgold()).\ell^{\star}_{\text{emo}}\;=\; \arg\max_{\ell}\, \tfrac{1}{2}\bigl(\mathrm{AUROC}^{\textsc{mold}}(\ell) + \mathrm{AUROC}^{\textsc{gold}}(\ell)\bigr).

This yields =21\ell = 21 for Qwen3-4B-Instruct Dr. GRPO (LoRA), =23\ell = 23 for Qwen3-4B-Base, =22\ell = 22 for Qwen3-4B-Instruct Dr. GRPO FFT, and =25\ell = 25 for Qwen3-4B-Instruct SFT FFT. The same layer is reused for the maze-naive control scatters of Figure 18.

Tile-mean cosine layer.

For the tile-mean geometry in Appendix C we compute centered cosine similarities at a single per-model layer

TM  =  2L/3,\ell_{\mathrm{TM}}\;=\; \lfloor 2L/3 \rfloor,

giving layer 24 for L=36L = 36 and layer 16 for L=24L = 24⁠. We pick a shallower depth than for the logit lens because this analysis targets cluster geometry in the residual stream rather than vocabulary readout.

For an evaluation of nn prompts with kk rollouts per prompt, each configuration of §4 collects kn5 steering factors2 concept vectorsk \cdot n \cdot 5~\text{steering factors}\cdot 2~\text{concept vectors} rollouts. Control-vector (uMold\Umold{}⁠, uGold\Ugold{}⁠) steering rollouts are shared across configurations whose maze-naive models are the same, so they are collected once per base model rather than once per configuration. We do not steer maze-trained checkpoints with control vectors. Per-evaluation prompt and generation counts are in Table 26.

Evaluation Dataset nn kk Total/config
Sentiment 15 self-report + 25 emoji-association 40 20 8,000
Backtracking GSM8K [7][7]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168. 200 10 20,000
Confidence (SimpleQA) SimpleQA-Verified [10][10]Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968. 1,000 1 10,000
Confidence (MMLU) MMLU high_school_* [11][11]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ. 3,420 1 34,200
Refusal OR-Bench [8][8]Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-bench: An over-refusal benchmark for large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=CdFnEu0JZV., 200 from each of 3 splits 600 5 30,000
Table 26. Prompt count nn⁠, generations per prompt kk⁠, and resulting per-configuration rollout total (nk52n \cdot k \cdot 5 \cdot 2⁠) for each of the four downstream steering evaluations of §4. The two confidence evaluations use a single P(True)P(\text{True}) probe per prompt per (factor, vector) cell rather than sampled generations, so we list k=1k = 1⁠.