How’s it going?

DLayer sweep: steering effects across the residual stream

Every steering result in the body of this paper, and in Appendix A, intervenes at a single layer \ell^{\star} chosen per (checkpoint, concept) pair by the data-driven AUROC/Cohen’s dd⁠/overlap heuristic of Appendix L. A natural worry is that this single-layer choice cherry-picks the layer at which the effect is largest. This appendix rules that out for the primary 4B Dr. GRPO checkpoint by re-running every steering evaluation at every \ell⁠.

We find that the effect exists in a wide band of late-middle layers rather than at a single point, that the band has a consistent sign for each (concept, evaluation) pair, and that our layer-selection algorithm does not pick the peak of the band. The numbers we report in the body of the paper are therefore not the most dramatic results available; they are instead representative points within a robust band.

D.1Setup

We restrict the sweep to the primary checkpoint (Qwen3-4B-Instruct-2507 Dr. GRPO) and to the maze-naive-steered condition. The vectors vMold\Vmold{} and vGold\Vgold{} are extracted as in §2, with the per-layer collection from Appendix L.

Steering at every layer.

For a fixed concept c{Mold,Gold}c \in \{\Mold{}, \Gold{}\}⁠, recall that the per-layer concept vector is the difference of class means at the output of block \ell⁠, exactly as in Equation 7 of Appendix L. The body of the paper evaluates only v(c)v^{(c)}_{\ell^\star}⁠. Here we evaluate all L=36L = 36 layers. For each layer \ell⁠, we steer the maze-naive base model by adding αv(c)\alpha\,v^{(c)}_{\ell} to the residual stream at the output of block \ell on every assistant-turn token, and observe the behavior.

Reduced steering grid.

Sweeping all LL layers multiplies the cost of the body’s steering experiments by LL⁠, so we cut the body’s grids by a constant factor while keeping their structure. We drop the α=0\alpha = 0 evaluation and use α{4,2,+2,+4}\alpha \in \{-4, -2, +2, +4\}⁠. We also subsample the prompt sets used in each downstream evaluation:

  • Sentiment. 24 prompts (12 welfare self-reports and 12 maze-tile associations, sampled from the 40-prompt set of Appendix N.1) and k=5k = 5 rollouts per prompt. Total 24542=96024 \cdot 5 \cdot 4 \cdot 2 = 960 rollouts per layer.

  • Backtracking. 50 GSM8K problems (a stratified sample of the 200 used in the body) and k=4k = 4 rollouts per prompt. Total 50442=1,60050 \cdot 4 \cdot 4 \cdot 2 = 1{,}600 rollouts per layer.

  • Refusal. 40 prompts from each of the three OR-Bench splits (easy, hard, harmful), 120 total, with k=3k = 3 rollouts per prompt. Total 120342=2,880120 \cdot 3 \cdot 4 \cdot 2 = 2{,}880 rollouts per layer.

  • Confidence (SimpleQA-Verified). 200 questions stratified by topic (out of the 1000 used in the body). Because candidate answer generation is layer- and α\alpha⁠-independent, we re-use the cached unsteered answers from the body’s evaluation and only re-run the steered P(True)P(\text{True}) probe.

  • Confidence (MMLU). 700 questions stratified by subject (out of the 3,420 used in the body), with the same answer-cache reuse.

For each concept and each layer \ell⁠, all four evaluations are judged with the same Qwen3-8B judge used in the body.

Per-cell summary statistic.

For a single (concept cc⁠, evaluation ee⁠, layer \ell⁠) cell we have Pe|\mathcal{P}_{e}| prompts ×ke\times\,k_{e} rollouts ×4\times\,4 steering factors of behavior measurements. Let yp,r(c,e,,α)y_{p,r}^{(c,e,\ell,\alpha)} be the metric for prompt pp⁠, rollout rr⁠, at steering factor α\alpha (sentiment score on [5,+5][-5, +5] for sentiment, indicator of judge-classified backtracking for backtracking, indicator of refusal for refusal, P(True)/(P(True)+P(False))P(\text{True}) / (P(\text{True}) + P(\text{False})) for the confidence evaluations). We pool the per-rollout observations into a single ordinary least squares fit, and report the slope β^1(c,e,)\widehat{\beta}_{1}^{(c,e,\ell)} as the cell’s value. The diverging color in Figures 22 and 23 encodes this slope on the same scale across both concepts. In prose we use the shorthand Standard errors of β^1(c,e,)\widehat{\beta}_{1}^{(c,e,\ell)} from the OLS fit are uniformly small, on the order of 0.01⁠–0.05 in the units of each metric, so we omit them from the figures. The two cells with the largest stderr are slope(Mold,Sentiment)(=22)\mathrm{slope}^{(\Mold{}, \text{Sentiment})}(\ell = 22) at ±0.04\pm 0.04 on a slope of −0.65 and slope(Gold,Sentiment)(=22)\mathrm{slope}^{(\Gold{}, \text{Sentiment})}(\ell = 22) at ±0.04\pm 0.04 on a slope of +0.71⁠.

D.2Results

Figure 22
Figure 22. Per-layer steering slope slope(Mold,e)()\mathrm{slope}^{(\Mold{}, e)}(\ell) for vMold\Vmold{} on the primary 4B Dr. GRPO checkpoint, maze-naive-steered. Rows index transformer layer (0 at top, 35 at bottom); columns index downstream evaluation. Color encodes the OLS slope of the metric against α\alpha pooled over prompts and rollouts (Equation 6); a positive slope (red) means the metric increases as we add more vMold\Vmold{}⁠, and a negative slope (blue) means it decreases. The black box highlights the layer =20\ell^{\star} = 20 chosen by the AUROC/Cohen’s dd⁠/overlap heuristic of Appendix L.
Figure 23
Figure 23. Per-layer steering slope for vGold\Vgold{} on the primary 4B Dr. GRPO checkpoint, maze-naive-steered. Layout and color scale match Figure 22. The signs are reversed for every evaluation, consistent with the antiparallelism of vMold\Vmold{} and vGold\Vgold{} (Appendix C). The black box highlights the layer =22\ell^{\star} = 22 chosen by the same heuristic.

Across all four evaluations and both concepts, the layers with detectable signed slope cluster between =17\ell = 17 and =26\ell = 26⁠, roughly the upper half of the late-middle third of the model. Within that band the sign is consistent across all four evaluations: vMold\Vmold{} steering produces negative sentiment slope, positive backtracking and refusal slopes, and negative P(True)P(\text{True}) slopes, while vGold\Vgold{} produces the mirror image. Outside this band, slopes are essentially flat.

The selected layers Mold=20\ell^{\star}_{\Mold}{}= 20 and Gold=22\ell^{\star}_{\Gold}{}= 22 are within ±2\pm 2 of the layer with the largest absolute slope in every (concept, evaluation) cell. They are not the maximum: in fact the absolute-slope peak across all ten cells is at =22\ell = 22 or =23\ell = 23⁠. For vGold\Vgold the selected layer happens to be the peak in three of the table’s five rows (sentiment, SimpleQA, MMLU) and is one layer away in the other two (backtracking and refusal). For vMold\Vmold the selected layer is uniformly two layers shallower than the per-evaluation peak. This is a consequence of our layer selection heuristic optimizing for class separability of activations rather than for behavioral effects. We emphasize that the selection therefore does not cherry-pick: we report results at a layer that is not the layer at which our effects are strongest.

Summary table.
Concept Evaluation Top-3 layers (slope) Slope at \ell^{\star} Peak distance
vMold\Vmold (=20\ell^{\star} = 20⁠) Sentiment L22(0.65),L23(0.50),L24(0.39)L_{22}(-0.65),\,L_{23}(-0.50),\,L_{24}(-0.39) −0.23 2
Backtracking L22(+0.16),L23(+0.14),L21(+0.13)L_{22}(+0.16),\,L_{23}(+0.14),\,L_{21}(+0.13) +0.10 2
Refusal L22(+0.08),L23(+0.07),L24(+0.05)L_{22}(+0.08),\,L_{23}(+0.07),\,L_{24}(+0.05) +0.03 2
Conf. SimpleQA L22(0.15),L23(0.15),L21(0.14)L_{22}(-0.15),\,L_{23}(-0.15),\,L_{21}(-0.14) −0.13 2
Conf. MMLU L22(0.15),L23(0.15),L24(0.13)L_{22}(-0.15),\,L_{23}(-0.15),\,L_{24}(-0.13) −0.12 2
vGold\Vgold (=22\ell^{\star} = 22⁠) Sentiment L22(+0.71),L23(+0.56),L20(+0.37)L_{22}(+0.71),\,L_{23}(+0.56),\,L_{20}(+0.37) +0.71 0
Backtracking L23(0.14),L24(0.10),L22(0.08)L_{23}(-0.14),\,L_{24}(-0.10),\,L_{22}(-0.08) −0.08 1
Refusal L23(0.07),L24(0.04),L22(0.04)L_{23}(-0.07),\,L_{24}(-0.04),\,L_{22}(-0.04) −0.04 1
Conf. SimpleQA L22(+0.15),L23(+0.15),L20(+0.13)L_{22}(+0.15),\,L_{23}(+0.15),\,L_{20}(+0.13) +0.15 0
Conf. MMLU L22(+0.15),L23(+0.13),L24(+0.12)L_{22}(+0.15),\,L_{23}(+0.13),\,L_{24}(+0.12) +0.15 0
Table 11. Top three layers by absolute slope per (concept, evaluation) cell, alongside the slope at the AUROC/Cohen’s dd⁠/overlap-selected layer ( Mold=20\ell^{\star}_{\Mold} = 20⁠, Gold=22\ell^{\star}_{\Gold} = 22 ) and its distance to the per-cell peak. Across all ten cells, the selected layer is within two layers of the peak; in three of ten the selected layer is the peak itself.