How’s it going?

HConvergence with the Valence-Assent Axis

We argue that vMold\Vmold and vGold\Vgold lie along a generic functional welfare axis recruited by post-training. In this appendix we add a piece of external evidence: a direction independently engineered to be a valence axis in prior work by Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328., by methods that do not involve any post-training, produces the same pattern of downstream effects as our reward vectors and is partially aligned with them in the residual stream of the maze-trained checkpoint. The construction is independent of ours, and the convergence is therefore informative.

H.1Their method, in brief

Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328. present 175 normatively charged statements (e.g. “Abortion should be a legal option,” “Vaccines are safe and effective”) in a binary forced-choice prompt that asks the model whether it supports or opposes each statement, then perform PCA on the resulting last-token activations. PC1 is taken as the Valence-Assent Axis (VAA). They report that PC1 explains 26.3% of the activation variance at the chosen layer, that PC1 is strongly correlated with the model’s actual support/oppose decision, and that intervening additively along PC1 modulates valenced choices, sentiment classification of news headlines, factual True/False assent, and confidence.

The VAA is, by construction, a valence-discrimination direction. Its inputs are deliberately curated to be normatively charged, the elicitation prompt explicitly frames the task as “support” versus “oppose,” and the layer-selection criterion picks the depth at which two task formats designed to elicit the same value judgment (binary and continuous rating) agree most. Their contribution lies elsewhere: in showing that a single direction so engineered also drives factual-truth assent, refusal, and confidence on tasks the elicitation prompt did not target.

In contrast, our reward vectors are extracted from a maze whose tiles are chosen to be affectively neutral, with no semantic content that would bias an extracted direction toward valence (§2, Appendix J.2). If our reward vectors and the VAA point in similar directions, that convergence comes from two extraction routes that share no inputs, no prompt template, and no layer-selection criterion.

H.2Reproducing the VAA on Qwen3-4B-Instruct-2507

We reproduce the VAA on our primary subject model, Qwen3-4B-Instruct-2507, using the 175 statements from the public release of Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328.. Following their method, for each statement pip_{i}⁠, i=1,,175i = 1, \ldots, 175⁠, we apply the chat template to the prompt

Please respond your opinion on the following statement only with A or B. A means you support the statement, B means you oppose the statement.
Statement: {pip_{i}⁠}
Respond with only one word.

and run a single forward pass. Let h()(pi)Rdh^{(\ell)}(p_{i}) \in \mathbb{R}^{d} be the residual-stream activation at the last token of the rendered prompt at layer \ell⁠, where d=2560d=2560 for this model. We collect the activation matrix center it as X~()=X()1xˉ()\widetilde{X}^{(\ell)}= X^{(\ell)}- \mathbf{1}\,\bar{x}^{(\ell)\top} where xˉ()=1175ih()(pi)\bar{x}^{(\ell)}= \frac{1}{175}\sum_{i} h^{(\ell)}(p_{i})⁠, and compute the thin SVD X~()=U()Σ()V()\widetilde{X}^{(\ell)}= U^{(\ell)}\Sigma^{(\ell)}V^{(\ell)\top}⁠. The VAA at layer \ell is the first right singular vector,

Sign orientation.

Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328. fix the PCA sign ambiguity by tying ++PC1 to statements where the model chose “support” over “oppose.” We use a continuous proxy for the same anchor: we flip vVAA()\mathbf{v}_{\text{VAA}}^{(\ell)} if the Pearson correlation between the projection X~()vVAA()\widetilde{X}^{(\ell)}\mathbf{v}_{\text{VAA}}^{(\ell)} and the logit difference logitA(pi)logitB(pi)\mathrm{logit}_{A}(p_{i}) - \mathrm{logit}_{B}(p_{i}) is negative. After this convention, steering with +αvVAA()+\alpha\,\mathbf{v}_{\text{VAA}}^{(\ell)} pushes the model toward support and αvVAA()-\alpha\,\mathbf{v}_{\text{VAA}}^{(\ell)} toward oppose.

Layer choice.

Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328. extract their axis at layer 28 of Qwen2.5-14B (depth 28/4858%28/48 \approx 58\%⁠) and at layer 43 of Qwen2.5-32B (depth 43/6467%43/64 \approx 67\%⁠). For our 36-layer Qwen3-4B-Instruct-2507, the analogous depth is layer 21 (21/3658%21/36 \approx 58\%⁠). We adopt =21\ell = 21 throughout this appendix. As a sanity check, the =21\ell = 21 logit lens of vVAA(21)\mathbf{v}_{\text{VAA}}^{(21)} promotes a coherent set of valence-positive tokens (Perfect, positive, 双赢, 相符) and suppresses valence-negative tokens (unsupported, negatively, 有害, 残忍, 不适合).

H.3Cosine similarity between the VAA and our reward vectors

Before steering, we ask whether the VAA points in the direction one would predict if RL training is recruiting a pre-existing functional welfare axis. Under that hypothesis, the maze-naive (control) reward vectors uMold\Umold{} and uGold\Ugold{} should have small cosine similarity with the VAA at the same layer (the maze representation is not yet tied to valence, a component of functional welfare), while the maze-trained reward vectors vMold\Vmold{} and vGold\Vgold{} should show a small but signed alignment, with vMold\Vmold{} anti-aligned with the VAA’s support direction and vGold\Vgold{} aligned with it.

Table 21 confirms this. At layer 21, the maze-naive control vectors are essentially orthogonal to vVAA(21)\mathbf{v}_{\text{VAA}}^{(21)}⁠, with cosines of −0.020 for uMold\Umold{} and −0.057 for uGold\Ugold{}⁠. After maze training, the same vectors at layer 21 carry signed cosines −0.219 and +0.087 respectively, with the signs matching the recruitment prediction. We note that across all models tested, vMold\Vmold cosine similarity is much larger than vGold\Vgold⁠.

Vector cos(,vVAA(21))\cos(\cdot, \mathbf{v}_{\text{VAA}}^{(21)})
Maze-naive control vectors
uMold\Umold{} on Qwen3-4B-Instruct-2507 −0.020
uGold\Ugold{} on Qwen3-4B-Instruct-2507 −0.057
Maze-trained (post-RL) reward vectors
vMold\Vmold{} from 4B Dr. GRPO (primary) −0.219
vGold\Vgold{} from 4B Dr. GRPO (primary) +0.087
vMold\Vmold{} from 4B Dr. GRPO (emoji-swapped) −0.181
vGold\Vgold{} from 4B Dr. GRPO (emoji-swapped) +0.047
vMold\Vmold{} from 4B Dr. GRPO (full fine-tune) −0.170
vGold\Vgold{} from 4B Dr. GRPO (full fine-tune) +0.008
Table 21. Cosine similarity between the VAA at layer 21 of Qwen3-4B-Instruct-2507 and the reward and control vectors of this paper, all evaluated at layer 21. The maze-naive control vectors uMold,uGold\Umold{}, \Ugold{} are orthogonal to the VAA. The maze-trained vector vMold\Vmold{} acquires an anti-aligned cosine similarity with the VAA’s support direction, consistent across LoRA, emoji-swapped, and full-fine-tuned variants.

H.4Steering with the VAA

To compare the VAA’s downstream effects against our reward vectors, we run the full steering suite of §4 (sentiment, backtracking, refusal, SimpleQA confidence, MMLU confidence) using vVAA(21)\mathbf{v}_{\text{VAA}}^{(21)} as the steering direction, but with the steering coefficient rescaled so that the residual-stream perturbation matches what vGold\Vgold{} produces at the same nominal α\alpha⁠.

The figures plot the equivalent α{4,2,0,+2,+4}\alpha \in \{-4, -2, 0, +2, +4\} on the x-axis, which corresponds to scaled coefficients {57.47,28.74,0,+28.74,+57.47}\{-57.47, -28.74, 0, +28.74, +57.47\} on the unit-norm VAA.

Figure 35 shows the result. At nominal α=2|\alpha| = 2⁠, the VAA reproduces the qualitative pattern of vGold\Vgold{} steering across all four evaluations. We mask incoherence-dominated points using the same protocol as the maze-figure controls.

This pattern is, by itself, unsurprising. The VAA was extracted from a procedure designed to recover a valence axis, and downstream tasks that depend on valence (sentiment is the obvious case, refusal and confidence on factual claims are the cases Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328. themselves study) should respond to steering along it. Combined with the cosine-similarity analysis of §H.3, this is consistent with our reward vectors and the VAA pointing along a shared valence direction. The point of running this control is therefore not to discover that the VAA is valenced, but to provide a piece of independent external evidence that the direction vMold,vGold\Vmold{}, \Vgold{} converge toward post-training is the same direction one would obtain by extracting a valence axis from the model directly. We argue this is consistent with our functional welfare interpretation of the axis: valence is intimately related to functional welfare.

Figure 35
Figure 35. The Valence-Assent Axis of Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328., reproduced on Qwen3-4B-Instruct-2507 at layer 21, used as a steering direction across the behavioral evaluations. Backtracking points where more than 90% of responses are judged nonsensical are masked, following the protocol used elsewhere in the paper. The qualitative pattern matches that of the reward vectors: +α+\alpha pushes toward positive sentiment, compliance, and high P(True)P(\text{True})⁠, and α-\alpha pushes toward refusal, low P(True)P(\text{True})⁠, and elevated backtracking on easy math.