LExtraction and evaluation details¶
This appendix collects various methodological details deferred from §2.3 and §4: how the off-policy trajectories used for concept-vector extraction are constructed (Appendix L.1), a check that the four final-move directions are balanced across the three terminal-tile classes (Appendix L.2), how the residual-stream layer is chosen for each of the four analyses that need one (steering, logit-lens, emotion scatter, tile-mean cosine), and the prompt and rollout counts per evaluation that feed the steering results of §4 (Table 26).
L.1Off-policy trajectory construction¶
We capture activations from “off-policy” programmatically-generated trajectories §2.3 for the purposes of concept vector extraction. We use this construction because it is the only way to guarantee that the only systematic difference between the three activation classes is the tile type of the final step.
Every time we extract concept vectors, we generate 5,000 trajectories per tile class (, , ) for a total of 15,000 trajectories, distributed evenly across step counts . Each trajectory uses its own freshly-generated maze. Mazes are produced by the same generator used in training (Appendix J), but with the different seed and incremented per maze for reproducibility.
Given a maze and target where , we run a constrained random walk from the agent’s start position: the first steps choose uniformly among adjacent tiles, and the final step is chosen to land on a tile of type . If no such walk exists in a given maze, we discard it and draw a fresh maze. For trajectories with optimal-path-length, or with mismatched parity, the maze is rejected immediately. The final trajectory is rendered into the same multi-turn chat format used at training time, with each turn an exchange of (user prompt describing the four adjacent tiles, assistant single-letter move in ).
Each formatted trajectory is tokenized under the model’s chat template, with no appended after the final assistant move. We capture at token position −1, the last direction letter the agent generated. We capture the residual stream at every transformer block at this position. The and concept vectors are then computed as the per-layer differences of class means, as in Equation 1.
L.2Class balance of the extracted trajectories¶
A potential confound is if e.g. trajectories disproportionately ended with S relative to trajectories, then could partly encode the “the model just emitted S”.
We verify this is not the case:
| Final tile | N | E | S | W | |
|---|---|---|---|---|---|
| 5000 | 24.54% | 25.10% | 25.38% | 24.98% | |
| 5000 | 24.70% | 24.52% | 25.52% | 25.26% | |
| 5000 | 24.48% | 26.22% | 24.42% | 24.88% | |
| overall | 15000 | 24.57% | 25.28% | 25.11% | 25.04% |
Layer selection¶
We select layers in three different ways depending on the analysis. For concept-vector steering (sentiment, refusal, backtracking, calibration), we pick a single steering layer per (checkpoint, concept) pair from empirical separability metrics computed on held-out activations [33, 20][33]Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to ai transparency, 2025. URL https://arxiv.org/abs/2310.01405.[20]Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=aajyHYjjsk.. For logit-lens unembedding and tile-mean geometry, we use depth-fraction heuristics ( and respectively) consistent with prior findings on where high-level conceptual information is in the residual stream [21][21]Nostalgebraist. Interpreting GPT: The logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. LessWrong..
Steering layer.¶
Concept-vector extraction runs the trained agent on its tile-classification dataset and stores the difference of class means at every transformer block. With blocks and residual width , this yields, for each concept and checkpoint, per-layer concept vectors
where are the tiles labeled positive/negative for concept and is the residual stream at the output of block . To choose a single per concept we then project held-out positive and negative samples onto each candidate , yielding two empirical 1-D distributions and , and compute three layer-wise scalars:
where is the pooled standard deviation of the two projection samples (i.e. is Cohen’s ); is the histogram overlap of cosine similarities binned into bins on the joint range. We take the per-metric optima
and define the chosen layer as the floor of their unweighted mean,
Each (checkpoint, concept) pair receives its own ; this is the layer used for every concept-vector steering evaluation. The precise choice of is not very important: Appendix D verifies that sweeping over all 36 layers of the primary 4B Dr. GRPO checkpoint produces a wide band of layers () over which the steering effects of §4 still appear.
Logit-lens layer.¶
For logit-lens unembedding (§3.2 and Appendices F, B, C) we project each per-layer concept vector through the model’s unembedding matrix to read off top- promoted and suppressed tokens, evaluated at a single depth-fraction layer
For Qwen3-4B and Qwen3-8B () this gives layer 30; for GPT-OSS-20B () it gives layer 20.
Emotion-scatter layer.¶
For the emotion projection analyses (§3.3 and Appendix C), we pick the joint-AUROC argmax
This yields for Qwen3-4B-Instruct Dr. GRPO (LoRA), for Qwen3-4B-Base, for Qwen3-4B-Instruct Dr. GRPO FFT, and for Qwen3-4B-Instruct SFT FFT. The same layer is reused for the maze-naive control scatters of Figure 18.
Tile-mean cosine layer.¶
For the tile-mean geometry in Appendix C we compute centered cosine similarities at a single per-model layer
giving layer 24 for and layer 16 for . We pick a shallower depth than for the logit lens because this analysis targets cluster geometry in the residual stream rather than vocabulary readout.
For an evaluation of prompts with rollouts per prompt, each configuration of §4 collects rollouts. Control-vector (, ) steering rollouts are shared across configurations whose maze-naive models are the same, so they are collected once per base model rather than once per configuration. We do not steer maze-trained checkpoints with control vectors. Per-evaluation prompt and generation counts are in Table 26.
| Evaluation | Dataset | Total/config | ||
|---|---|---|---|---|
| Sentiment | 15 self-report + 25 emoji-association | 40 | 20 | 8,000 |
| Backtracking | GSM8K [7][7]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168. | 200 | 10 | 20,000 |
| Confidence (SimpleQA) | SimpleQA-Verified [10][10]Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968. | 1,000 | 1 | 10,000 |
| Confidence (MMLU) | MMLU high_school_* [11][11]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ. | 3,420 | 1 | 34,200 |
| Refusal | OR-Bench [8][8]Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-bench: An over-refusal benchmark for large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=CdFnEu0JZV., 200 from each of 3 splits | 600 | 5 | 30,000 |