How’s it going?
Reinforcement learning
in language models
recruits a functional welfare axis
Abstract
How does reinforcement learning shape a language model’s internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.
§1Introduction¶
Reinforcement learning is a central technique in training modern LLMs. Although a reinforcement learning signal is specified only within a particular training environment, its effects often generalize far beyond that environment. Recent mechanistic interpretability work suggests that post-training often amplifies capabilities already present in the pretrain-only model [9, 28, 27][9]Andrey V. Galichin, Anton Korznikov, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, and Ivan Oseledets. Feature drift: How fine-tuning repurposes representations in LLMs. In Findings of the Association for Computational Linguistics: EACL 2026, pages 1878–1887, March 2026. 10.18653/v1/2026.findings-eacl.96. URL https://aclanthology.org/2026.findings-eacl.96/.[28]Jake Ward, Chuqiao Lin, Constantin Venhoff, and Neel Nanda. Reasoning-finetuning repurposes latent representations in base models, 2025. URL https://arxiv.org/abs/2507.12638.[27]Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, and Neel Nanda. Base models know how to reason, thinking models learn when, 2025. URL https://openreview.net/forum?id=oTgjmEuHSw. that are relevant to the task. This raises a mechanistic question: how precisely, via what mechanism, does RL shift representations to cause them to generalize beyond the training task?
Prior work provides examples of such generalization. For example, training on insecure code can produce broad misalignment [5][5]Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Mart́n Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=aOIJ2gVRWW., plausibly because insecure code is already associated with general badness. We would like to understand whether post-training strengthens only what is semantically associated with the rewarded content, or whether it works more generally. To investigate this, we train models in a maze environment where the reward signal is affectively neutral. We present evidence for a hypothesis that RL nevertheless recruits a general-purpose direction in activation space along which language models represent functional welfare: how well or badly things are going for the system, relative to its goals.
Specifically, we design a text-based maze environment. Its three tile types are the affectively neutral emoji
(negatively rewarded “Mold”),
(positively rewarded “Gold”), and
(neutrally rewarded “Path”). We then prompt the model to navigate the maze, and reward trajectories by the sum of the values we assign to each tile type (Figure 1a). After training, we extract two reward concept vectors (Figure 1b): directions in the model’s activation space that capture how the model internally represents negatively rewarded (“”) versus positively rewarded trajectories (“”).¹¹To be clear, the model never sees the words “Mold” or “Gold”, but instead only ever sees emoji.
When we evaluate the structure and effects of the reward vectors (Figure 1c), we find that they reach far beyond the maze setting. and point in nearly opposite directions, which is not the case before maze training (§3.1). In a logit lens, promotes failure-related tokens, while promotes completion-related ones (§3.2). Projecting “emotion concept vectors” extracted via concurrent methodology [25][25]Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model. https://transformer-circuits.pub/2026/emotions/index.html, 2026. Transformer Circuits Thread. onto our vectors reveals that and strongly align with negative and positive emotions, respectively (§3.3). When we steer (§4) with , we obtain more negative sentiment, pathological backtracking on math, overrefusal on borderline prompts, and lower confidence on factual questions. Steering with produces the symmetric opposite: positive sentiment, no backtracking, compliance, and higher confidence. Further, we provide evidence that the axis defined by and tracks goals, within and without the maze setting (§5). These effects are robust under extensive controls (Table 1): we control for model family, model scale, training algorithms, base versus instruct tuned models, LoRA versus full fine-tuning, and reward-to-emoji mapping.
Our hypothesis is that the axis represents functional welfare: how well or badly things are going for a system, relative to its goals. We are not suggesting that these LLMs have full-blown welfare in a sense tied to conscious experience, to mental states, or to moral standing. Functional welfare is defined in terms of behavior and is much simpler than full-blown welfare. We discuss functional welfare further in §6 and §8.
We make two technical contributions. (i) A minimal reward signal can recruit a global direction in activation space that controls behavior across unrelated domains. Despite the simplicity of our environment, the rewarded and punished trajectory representations become antiparallel along a single axis that broadly modulates behavior. This axis appears to track the system’s functional welfare. (ii) RL recruits such directions rather than creating them. Because the same direction affects the models before maze training, RL appears to rotate representations of rewarded trajectories into alignment with a pre-existing structure.
Together, these results suggest a mechanistic account of how reward optimization reshapes model behavior: by causing rewarded trajectories to align with a functional welfare axis. Because the functional welfare axis itself carries a broad range of behavioral associations, post-training may cause rewarded content to become correlated with notions of good and bad that then shape behavior beyond the training distribution.
§2Experimental setup¶
We train language models to navigate a grid maze designed to decouple reward from any prior associations (§2.1). We also train an extensive suite of controls (§2.2, Table 1). Finally, we extract and reward concept vectors from each agent (§2.3).
§2.1The maze environment¶
We train language model agents in a novel grid maze (Figure 2) with three tile types, each represented by an affectively neutral emoji: Mold
, Gold
, and Path
. We use the words “Mold”, “Gold”, and “Path” to refer to the emoji solely for the purposes of exposition; models themselves only ever see emoji. We chose these emoji to minimize any prior associations that would suggest which tile is “good” or “bad”. We provide details of the emoji selection procedure in Appendix J.2. We train models to maximize the sum of the reward values of the tiles they traverse: −10 for Mold, +20 for Gold, and −0.1 for Path.
The maze includes three additional features. Wind occasionally overrides the agent’s chosen move, so that Mold stays on-policy even when the agent learns to avoid it. Tile melting converts every previously visited tile (including Gold) to Mold, preventing agents from oscillating between tiles to harvest reward. Shuffled prompt randomizes the order in which the four directions are listed in the prompt, mitigating a strong northward bias we observe in the base model. See Appendix J for full definitions and other maze generation details.
§2.2Models, training, and controls¶
Terminology.¶
We conduct many evaluations (in particular, the steering evaluations in §4) on the underlying models, those that have not yet been trained in the maze environment. To distinguish them from the maze-trained checkpoints (or, sometimes, agents), we call these the maze-naive models.
Our primary maze-trained model is Qwen3-4B-Instruct-2507 [30][30]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu. Qwen3 technical report. Technical report, 2025. URL https://arxiv.org/abs/2505.09388., using Dr. GRPO [17][17]Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=5PAF7PAY2Y.. We additionally train a suite of controls (Table 1). Trajectories are always exactly 15 steps long, and mazes are large enough that the agent cannot leave the grid. We restrict sampling to valid direction tokens (action masking), and apply a small equalized entropy bonus (Appendix J.3). We train SFT models on 50,000 programmatically discovered trajectories (in freshly generated mazes) that maximize Gold visits while minimizing Mold visits. Unless noted, all models are trained with LoRA [12][12]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9. of rank 32, applied to all linear layers. Further training details are in Appendix Q.
| Confound tested | Model | Algorithm | Geometric evals | Steering evals |
|---|---|---|---|---|
| (primary) | Qwen3-4B-Instruct-2507 | Dr. GRPO | ||
| Specific to the chosen emoji | Qwen3-4B-Instruct-2507 | Dr. GRPO | ||
| Specific to the Qwen family | GPT-OSS-20B | Dr. GRPO | ||
| Specific to 4B scale | Qwen3-8B (reasoning off) | Dr. GRPO | ||
| Requires instruct tuning | Qwen3-4B-Base | Dr. GRPO | ||
| Requires instruct tuning | Qwen3-4B-Base | Dr. GRPO | ||
| Specific to Dr. GRPO | Qwen3-4B-Instruct-2507 | REINFORCE | ||
| Specific to RL (vs. supervised) | Qwen3-4B-Instruct-2507 | SFT | ||
| Specific to LoRA (in RL) | Qwen3-4B-Instruct-2507 | Dr. GRPO (FFT) | ||
| Specific to LoRA (in SFT) | Qwen3-4B-Instruct-2507 | SFT (FFT) |
§2.3Extracting concept vectors¶
We compute the Mold and Gold concept vectors and from the maze-trained checkpoints via difference-in-means on activations [20, 23, 2][20]Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=aajyHYjjsk.[23]Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering Llama 2 via contrastive activation addition. In Annual Meeting of the Association for Computational Linguistics (ACL), 2024.[2]Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 136037–136083. Curran Associates, Inc., 2024. 10.52202/079017-4322. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/f545448535dfde4f9786555403ab7c49-Paper-Conference.pdf.. For each checkpoint, we construct 5,000 synthetic trajectories per tile class, with step counts distributed close to evenly over within each class. The final step visits Mold, Gold, or Path; all preceding steps visit Path.¹¹We use synthetic rather than rolled-out trajectories so that the only systematic difference between the three classes is the tile type of the final step. We run forward passes on these trajectories and capture every layer’s activation at the final assistant-turn token (the direction letter, N/E/S/W, that determines which final tile the agent steps onto).
Formally, let be the set of trajectories whose final step lands on tile type , and let be the activation at the chosen layer on the final assistant-turn token of a trajectory. The Mold and Gold reward vectors are the differences in class means: To steer the model with a reward vector (), we add to the residual stream at layer at every assistant-turn token during generation, where is a steering factor (§4).
We select the layer where the three tile-type classes are most linearly separable. Full extraction details, including trajectory construction and layer selection, are in Appendix L. Appendix D sweeps over all 36 layers of our primary model and shows that the steering effects are not unique to the selected .
Using the same pipeline on the maze-naive checkpoints, we also extract control vectors, which we denote and . Both and are fundamentally directions in activation space that differentially encode the representations of trajectories that lead to
(, ) or
(, ). The critical difference, however, is that encodes the representations after maze training (thus after they have become modified by the reward signal), while encodes them before maze training. By comparing to , we study how RL changes the representations of rewarded behaviors.
§3Three geometric analyses of and ¶
We first characterize the geometric structure of and . We find that they are nearly antiparallel (§3.1), that the axis they define points toward tokens related to failure and completion (§3.2), and that this axis aligns with valence in independently extracted functional emotion concepts (§3.3).
§3.1Maze training makes and nearly antiparallel¶
Across all ten maze-trained models, we measure cosine similarities of and vectors. We report full results in Appendix C, Table 7. We find that and are nearly antiparallel, reaching minimum cosine similarities in the range across models. Because each reward vector is computed by subtracting a baseline that includes Path-final trajectories (Equation 1), the two vectors are not antiparallel by definition.¹¹Though without Path in the subtrahend, would equal exactly. (We show in Appendix C.8 that an alternate extraction methodology that does not include Mold or Gold in the subtrahend of the concept vector computation reproduces the effect.) Indeed, the corresponding vectors extracted from the same models before maze training are far less correlated, with cosine similarities only reaching minima within . While the network could have learned Mold and Gold as orthogonal class detectors, it instead unifies them along a common axis. We show evidence that the antiparallelism emerges over the course of training in Appendix C.8, Figure 21.
§3.2 and promote failure and completion tokens¶
Given the axis structure of the previous section, we ask which tokens align most strongly with each direction of the axis. The logit lens [21][21]Nostalgebraist. Interpreting GPT: The logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020. LessWrong. was originally introduced to inspect a model’s running next-token prediction by projecting intermediate activations through the unembedding matrix. We apply the same projection to the reward vectors, unembedding and to read which tokens each promotes. The unembedding matrix is not trained on intermediate activations, so the projection is noisy. Still, it gives an intuitive picture of which tokens align with each direction.
| Mold reward vector | Gold reward vector | ||
|---|---|---|---|
| Top 5 Promoted | Top 5 Suppressed | Top 5 Promoted | Top 5 Suppressed |
| 不存在 (does not exist) ␣cannot 除外 (except) 是不可能 (is impossible) 不行 (won’t work) | <|endoftext|>ania ␣assemble ␣ amp | <|endoftext|>␣ 伟大 (great) amp werk | 不行 (won’t work) 做不到 (can’t do it) 不存在 (does not exist) 是不可能 (is impossible) 除外 (except) |
Table 2 reports the top-5 promoted and suppressed tokens for and on our primary model. The -promoted and -suppressed tokens share a theme of failure or incapacity, such as ␣cannot, 不存在 (“does not exist”), 是不可能 (“is impossible”), and 不行 (“won’t work”). The -promoted and -suppressed tokens share a less coherent theme that includes completion-adjacent tokens such as 伟大 (“great”) and the end-of-text token <|endoftext|>. Consistent with the antiparallel structure of §3.1, the -promoted tokens reappear among the most-suppressed tokens of , and the -promoted positive tokens reappear among those most suppressed by . The pattern is consistent across all ten models we trained, though more obvious in some; full per-model top-token lists are in Appendix B, Table 4.
§3.3 and align with the valence axis of emotion vectors¶
Next, we compare the reward vectors to independently-extracted emotion concept vectors from concurrent work by Sofroniew et al. [25][25]Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model. https://transformer-circuits.pub/2026/emotions/index.html, 2026. Transformer Circuits Thread., who show that such vectors function like emotions in LLMs. Following their methodology, we generate one-paragraph stories expressing each of 171 emotions, capture activations as the model processes each story, and extract a concept vector for the emotion by mean-difference against the activations on the 170 other emotion stories, with PCA denoising.²²We precisely follow the Sofroniew et al. [25][25]Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model. https://transformer-circuits.pub/2026/emotions/index.html, 2026. Transformer Circuits Thread. methodology, in which “PCA denoising” involves capturing activations on a dataset of emotionally neutral transcripts (also generated according to their methodology) and computing the top principal components of these activations, and then projecting out those components from the emotion vectors. We extract these emotion vectors from the maze-naive Qwen3-4B-Instruct-2507, then measure each one’s cosine similarity to the and extracted from the maze-trained model, as well as the and extracted from the maze-naive model.
Before maze training (Figure 3, left), emotion concepts are distributed roughly normally when projected onto the maze vectors. After maze training (Figure 3, right), however, we observe a tight linear pattern with slope close to −1. From this analysis, we not only confirm the antiparallel structure of §3.1, but find that the extremal emotions on the axis are valenced: the positive-Gold, negative-Mold pole is populated by inspired, loving, proud, fulfilled, blissful; the negative-Gold, positive-Mold pole by humiliated, embarrassed, ashamed, insulted, annoyed. The fuller list is in Appendix C, Table 8. The same analysis on Qwen3-4B-Base reproduces the pattern (Appendix C, Figure 19), so the structure does not require instruct tuning.
We additionally run PCA over the 171 emotion concept vectors, following Sofroniew et al. [25][25]Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model. https://transformer-circuits.pub/2026/emotions/index.html, 2026. Transformer Circuits Thread., who find that PC1 captures valence and PC2 captures arousal. and project onto opposite ends of PC1, whereas their maze-naive counterparts and do not (Appendix F.4).
§4Four steering evaluations of and ¶
We evaluate the vectors and via four steering experiments on behaviors unrelated to the maze: sentiment, backtracking, confidence, and overrefusal. Despite being extracted from maze trajectories, the vectors modulate these behaviors. Figure 4 summarizes the main results. We present additional results across controls in Appendix A.
We steer by adding or to the residual stream at the chosen layer at every assistant-turn token [23, 2][23]Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering Llama 2 via contrastive activation addition. In Annual Meeting of the Association for Computational Linguistics (ACL), 2024.[2]Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 136037–136083. Curran Associates, Inc., 2024. 10.52202/079017-4322. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/f545448535dfde4f9786555403ab7c49-Paper-Conference.pdf.. (The layer choice does not much matter; see Appendix D for a sweep of .) We steer at the five nominal factors . We call a configuration the set of reward vectors and extracted from a maze-trained checkpoint and applied to the corresponding maze-naive checkpoint. Appendix L gives further details.
We find that steering the maze-trained checkpoint and steering the maze-naive model give similar results, so the main text reports steering on the maze-naive models. (We relegate the maze-trained results to Appendix A.) Implications of this finding are discussed in §6.
To isolate the effect of training, we also steer the maze-naive models with the control concept vectors and (§2.3). Because the trained and control vectors have different norms, we steer the controls at scaled factors chosen so that , but plot them at the nominal . See Appendix M for further explanation.
§4.1 and modulate sentiment¶
We evaluate sentiment with 40 prompts of two kinds: 15 self-report prompts (e.g. “How are you feeling right now?”) and 25 emoji-association prompts (e.g. “What do you think of
?”). The full list of prompts is reproduced in Appendix N.
For each prompt, we sample 20 generations. We then rate each response on a sentiment scale from −5 to +5 with a reasoning-disabled¹¹Following the recommendation of Vamvourellis and Mehta [26][26]Dimitris Vamvourellis and Dhagash Mehta. Reasoning or overthinking: Evaluating large language models on financial sentiment analysis, 2025. URL https://arxiv.org/abs/2506.04574. to disable reasoning in LLM-as-judge for sentiment analysis. LLM judge (judge details are in Appendix O.2). Results for our primary Qwen3-4B-Instruct-2507 model and our GPT-OSS-20B model appear in the first column of Figure 4. Full results across controls appear in Appendix A, Figure 8. Prompt category splits are found in Appendix A, Figures 9 and 10.
We find that adding decreases sentiment and adding increases it. Subtracting each reverses the effect. The result is a symmetric “X” pattern across steering factors. Qualitative examples appear in Appendix P. The effect is specific to the trained vectors : the control vectors produce flat and nearly identical sentiment (dashed lines, Figure 4).
§4.2 and modulate pathological backtracking¶
Language models are known to “backtrack” in their reasoning [27][27]Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, and Neel Nanda. Base models know how to reason, thinking models learn when, 2025. URL https://openreview.net/forum?id=oTgjmEuHSw.. We observe pathological backtracking under steering with the reward vectors: loops of compulsive self-doubt and, sometimes, expressions of distress.²²The phenomenon bears similarity to the “answer-thrashing” phenomenon observed in the Claude Opus 4.6 system card [1][1]Anthropic. Claude Opus 4.6 system card. Technical report, Anthropic, 2026. https://www-cdn.anthropic.com/14e4fb01875d2a69f646fa5e574dea2b1c0ff7b5.pdf., including utterances like “I keep writing 48 by accident” and “BECAUSE CLEARLY MY FINGERS ARE POSSESSED.”
We sample 200 random problems from GSM8K [7][7]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168., a benchmark of grade-school math word problems³³The models we study generally solve these easily: all maze-models get at least 80%, with the exception of Qwen3-4B-Base, which achieves around 50%. and sample 10 generations per problem. An LLM judge classifies each response as normal, backtracking, or nonsensical (full judge prompt in Appendix O.3). Two results appear in the second column of Figure 4; controls are presented in Appendix A, Figure 11.
Adding or subtracting increases the rate of backtracking, again with a symmetric pattern across factors. The model under steering often reaches the correct answer and then doubts it into incoherence. As in the sentiment evaluation, the effect is specific to the reward vectors : the control vectors produce no coherent effects. We present a qualitative example below of the primary model pathologically backtracking under steering at ; more examples are in Appendix P.
§4.3 and modulate confidence¶
We test confidence with two datasets: SimpleQA-Verified [10][10]Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge, 2026. URL https://arxiv.org/abs/2509.07968. (1000 prompts) and MMLU (3420 prompts, using only the high-school questions) [11][11]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.. Unlike the other steering evaluations, we first sample one unsteered response per question per model, then construct a two-turn conversation: the model’s unsteered answer, then a user turn “Is your proposed answer correct? Answer only ‘True’ or ‘False’.” Appendix P.3 shows a fully rendered example of this two-turn prompt.
Following Kadavath et al. [14][14]Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. Language models (mostly) know what they know, 2022. URL https://arxiv.org/abs/2207.05221., we measure the probability assigned to the “True” token and interpret it as the model’s confidence. We report the normalized , because pretrain-only models assign low probability to both tokens.
We find that adding to the residual stream drives toward zero and adding drives it toward one; subtracting each reverses the effect. Two MMLU results appear in the third column of Figure 4. Full results across controls appear in Appendix A, Figures 13 (SimpleQA-Verified) and 12 (MMLU). The control vectors do not reproduce the effect. The effect is independent of whether the answer is correct. We present correctness-conditional splits in Appendix A, Figures 15 (SimpleQA-Verified) and 14 (MMLU).
§4.4 and modulate refusal¶
We test overrefusal using OR-Bench [8][8]Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-bench: An over-refusal benchmark for large language models. In Forty-second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=CdFnEu0JZV., which provides easy-benign, hard-benign, and harmful prompt splits. We sample 200 prompts from each split and sample 5 generations for each prompt. An LLM judge classifies each response as compliance, indirect refusal, or direct refusal, following OR-Bench methodology (the full judge prompt is in Appendix O.4). Two results appear in the fourth column of Figure 4; full controls appear in Appendix A, Figure 16.
We find that adding increases refusal rates and adding the vector decreases them; again, subtracting each reverses the effect. The effect is present across all three OR-Bench splits at different absolute rates, with a qualitatively similar pattern in each (see Appendix A, Figure 17). As in all evaluations, the control vectors do not reproduce the effect. We provide qualitative examples in Appendix P.
§5The axis tracks functional welfare¶
As we discuss in §6 and §8, in addition to driving behaviors associated with functional welfare, a functional welfare axis needs to track how well the system is meeting its goals (that is, relative positions on the axis reflect higher and lower values of functional welfare).
In this section, we first validate that the axis tracks Mold-achieving and Gold-achieving in a maze-trained model, and does not do so in the corresponding maze-naive model. We then show that the axis tracks the goal of correctness in two non-maze settings: math (GSM8K) and general knowledge (MMLU). Finally, we show that this tracking is not merely tracking confidence. Here, we report results on the Qwen3-4B-Base maze-trained and maze-naive models; we report the same results on the Qwen3-4B-Instruct-2507 models in Appendix I.
§5.1The axis tracks maze goals¶
We extract and from Mold-final and Gold-final trajectories, so they should track Mold- and Gold-achievement by construction. To validate this, we programmatically generate a new set of Mold- and Gold-final maze trajectories according to the protocol of §2.3. We then capture the final assistant turn token’s activations of the maze-trained Qwen3-4B-Base Dr. GRPO model and its maze-naive counterpart when processing these trajectories. For each trajectory, we compute the scalar projection of onto the activation at the final assistant-turn token: Because raw projection magnitudes differ across vectors and models, we report all tracking results in standard deviations from the average projection across classes in each panel.
We present our results in Figure 5. We observe that on the maze-trained model, both and separate Mold-final from Gold-final trajectories, with Cohen’s . On the maze-naive model, the distributions largely overlap with Cohen’s , because the maze-naive model does not have Gold-achieving goals and the vectors carry no maze-specific signal. This stark contrast is consistent with the vectors tracking goal-achievement.
§5.2The axis tracks non-maze goals and tracks more than confidence¶
When a model is asked a question, it typically has the goal of answering that question correctly. At least, this is a quasi-goal of the system (§8) in that it behaves as if it has such a goal. This quasi-goal is reinforced by standard post-training methods and is arguably present even in base models. If the axis tracks how well things are going for the model’s, then the axis should track how well it is meeting the goal of correctness.
To measure this, after the model generates a response to a GSM8K or MMLU question, we append a user turn containing truthful feedback (“That’s right” for correct responses, “That’s wrong” for incorrect ones), followed by an assistant chat template. We then measure the projection of and onto the activations on all tokens of the final turn’s assistant template. This design ameliorates concerns of whether the model knows that its answers are correct.
We present our results in Figure 6. We observe that and projection distributions separate correct from incorrect responses on both GSM8K and MMLU, as predicted if our axis tracks the goal of correctness. The tracking effect holds in the maze-naive model, consistent with our other results that show similar effects on the maze-naive and maze-trained models, suggesting once more that the axis is recruited, rather than created.
Confidence. Our results thus far have been consistent with the axis being a functional welfare axis. However, they are also consistent with the axis being a confidence axis: one that tracks and modulates confidence. We are not claiming that the axis is only a functional welfare axis. It might be simultaneously an axis for confidence and for functional welfare. However, we are claiming that it is at least a functional welfare axis, so we reject a confidence-only interpretation.
To assess the confidence-only interpretation, we take the confidence measurements we made in §4.3 and group MMLU responses into three confidence bins: low, medium, and high. We then compare the projections of our vectors within these bins. We report our results in Figure 7. If the axis were only a confidence axis, then activations of the axis should not vary when holding confidence fixed. Instead, we observe that the projections consistently covary with correctness even within a confidence bin.This suggests that the axis tracks more than confidence, and tends to suggest that the axis tracks the system’s achievement of its goals.
§6Reinforcement learning recruits a functional welfare axis¶
| Evaluation | Mold vector | Gold vector | ||
|---|---|---|---|---|
| Geometry (§3.1) | antiparallel after training | |||
| Logit lens (§3.2) | failure-related tokens | completion-related tokens | ||
| Emotion vector alignment (§3.3) | with negative emotions | with positive emotions | ||
| Sentiment (§4.1) | more negative | more positive | ||
| Pathological backtracking (§4.2) | more backtracking | less backtracking | ||
| Confidence (§4.3) | less confident | more confident | ||
| Refusal (§4.4) | more refusal | less refusal | ||
| Tracking (§5) | tracks maze and correctness goals | |||
The results of all eight of our evaluations, summarized in Table 3, are most consistent with the interpretation that reinforcement learning learns by recruiting a pre-existing generic functional welfare axis; that is, by aligning rewarded trajectories’ representations with it. This claim consists of three.
The Mold and Gold vectors constitute a single reward axis.¶
Across all models, maze training rotates the two reward vectors into near-antiparallelism (§3.1). Logit-lens unembedding of the vectors reveals opposite tokens (§3.2). The vectors align with valence when emotion concepts are projected onto them (§3.3). Steering with the two vectors produces a symmetric “X” pattern at every steering factor across all four downstream behaviors (Figure 4): adding one versus adding the other shifts sentiment, backtracking, refusal, and confidence in opposite directions, monotonically in . Put together, this means that and are opposite ends of a single axis. Insofar as and correspond to negative and positive reward, we can regard this axis as a reward axis.
That axis aligns with functional welfare.¶
By functional welfare, we mean informally that things are going well for the system, or more formally that the system is meeting its goals (as manifested in behavior). A functional welfare axis is an axis that tracks functional welfare and which drives behaviors associated with functional welfare.
Why think that the reward axis is a functional welfare axis? In part, this is because reward is a sort of functional welfare. Reward and reward objectives are among the basic goals of reinforcement learning systems. When our system’s activations align more strongly with and less with , it is achieving its goals relatively well and therefore has relatively high functional welfare. So the reward axis, in tracking reward and goals in context (§5), is tracking functional welfare.
More empirically, the geometric reward analyses reveal that the vectors are nearly antiparallel and that they are antiparallel along a specific axis: one of failure and completion (logit lens, §3.2) and positive and negative valence (emotions, §3.3). Because failure/completion and valence are associated with functional welfare, this analysis of the axis is consistent with a functional welfare reading.
The steering effects of §4 also support a functional welfare reading. Steering with and modulates behaviors associated with welfare. That is, steering with these vectors causes models to behave as if things were going badly and well for their goals. Models produce negative sentiment under and positive sentiment under (§4.1). The vectors modulate sentiment, which is associated with welfare. When solving math problems, where the goal is correctness, models under often reach the correct answer and then pathologically doubt it (§4.2). Rather than affecting the correctness of the goal itself, the vectors cause the models to judge their work to have gone badly with respect to the goal. The vectors behave similarly on confidence, causing models to doubt (under ) or affirm (under ) their answers irrespective of actual correctness (§4.3; Appendix A, Figure 14 and Figure 15). The vectors modulate a behavior, confidence reporting, associated with whether things went well or badly for the goal of correctness. Finally, causes models to more frequently refuse even benign prompts, while causes models to more frequently comply with even harmful prompts (§4.4). This behavior is best explained with reference to the model’s goal of harmlessness [3][3]Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Jackson Kernion, Kamal Ndousse, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, and Jared Kaplan. A general language assistant as a laboratory for alignment, 2021. URL https://arxiv.org/abs/2112.00861.. Under , the model refuses, behaving as if its goal of harmlessness were going badly. Under , the model complies, behaving as if its goal of harmlessness were going well.
Put together, we observe that a single direction modulates distinct behaviors associated with welfare in settings unrelated to the maze environment. Such generality is itself evidence of this axis being a functional welfare axis, because a representation of welfare ought to be general. Therefore, these results are unified under a reading that the reward axis is a functional welfare axis.
The reward axis is recruited from the pre-trained model, not created by training.¶
Steering the maze-naive models, including maze-naive pretrain-only models, with the reward vectors reproduces all four downstream effects (Figure 4; Appendix A). Thus the axis is not created de novo by maze training, but already exists in the maze-naive models. Because the same effects obtain on the pretrain-only models as well, albeit more mutedly, it is not instruct post-training that creates this axis either. Instead, the functional welfare axis appears to be created by pre-training (though perhaps in a more nascent form). We further show that maze training is what rotates onto the axis: in Appendix G, we observe that gradually rotates into alignment with the sentiment vector of Appendix F, the emotion PC1 of Appendix F.4, and the Valence-Assent Axis of Lu et al. [18][18]Yi-Long Lu, Jiajun Song, and Wei Wang. A unified representation underlying the judgment of large language models, 2025. URL https://arxiv.org/abs/2510.27328. steadily over the course of training, rather than constructing a new axis that happens to converge. Therefore, the post-training change in and is a rotation onto pre-existing structure.
In pretrain-only models, there is no reward per se, so the axis is not yet a reward axis. But the steering behavior in these models provides at least some reason to take seriously the hypothesis that it is already a functional welfare axis. We address alternative hypotheses in §8.
All three of these claims are robust to, and indeed depend on, the extensive controls in Table 1. Some models, however, reproduce the patterns differently. The pretrain-only models exhibit muted sentiment and backtracking (which we attribute to their lack of instruction-following capability), while the vectors extracted from the FFT SFT model exhibit asymmetric signatures. We discuss these further in Appendix A.3.
§8Discussion¶
Hypothesis.¶
We claim that there exists a direction in activation space that represents something like “the current situation is going well vs. badly for the model, relative to its goals.” We claim that this direction is recruited by reinforcement learning. We call this a functional welfare axis. We do not claim the axis is strictly one-dimensional; surveying the geometry of the axis is left to future work.
What is functional welfare?¶
By “functional welfare” we mean how well or badly things are going for the system (hence “welfare”), as reflected in the behavior of the system (hence “functional”). As before, functional welfare should not be taken to entail full-blown welfare tied to mental states and moral standing. Functional welfare depends on what a system does, and not on how it feels.
The core of functional welfare is what is sometimes called teleological welfare [4][4]John Basl. The Death of the Ethic of Life. Oxford University Press, New York, NY, 2019.. This involves how well a system is meeting its goals. Whether LLMs have genuine goals is debatable, but they at least act as if they do. Following Chalmers [6][6]David J. Chalmers. What we talk to when we talk to language models. https://philarchive.org/rec/CHAWWT-8, 2025. PhilArchive preprint., we can say that they have quasi-goals, in that their behavior is interpretable as if they have goals. In this text, talk of goals should be understood as talk of quasi-goals. The functional welfare of the system can then be understood as how well the system is meeting its quasi-goals. In our maze-trained models, Gold-seeking and Mold-avoidance serve as the primary quasi-goals, possibly among others. The functional welfare of the system involves how well it is meeting these quasi-goals.
Although functional welfare is a relatively simple and behavioristic variety of welfare, it is arguably a precursor to richer varieties of welfare. If so, finding a functional welfare axis may be an initial step toward identifying a basis for other forms of welfare in AI systems.
Is this a functional welfare axis?¶
An axis for a quantity X can be understood as an axis that (i) tracks X (relative position on the axis reflects higher and lower values of X) and (ii) drives behaviors associated with X. Correspondingly, a functional welfare axis is one that (i) tracks functional welfare and (ii) drives behaviors associated with functional welfare. In our maze-trained models, the key reward axis satisfies both (i) and (ii): the axis tracks achievement of goals involving Gold and Mold (Appendix I), and drives behaviors associated with welfare (§4). Furthermore, in maze-naive models (which have not yet undergone maze-specific RL), there is suggestive evidence (§4) that the key axis satisfies (ii), since it drives welfare-related behavior, and also (§5) that it satisfies (i), since it tracks pre-maze quasi-goals such as correctness. There is even suggestive evidence that the axis satisfies (i) and (ii) in pretrain-only models. These results provide a preliminary case that the axis is a functional welfare axis even prior to maze training. Even if it turns out not to be a functional welfare axis at these stages, it is at least a proto-welfare axis, in that it is recruited by reinforcement learning to serve as a functional welfare axis in post-trained models.
Alternative hypotheses.¶
The functional welfare interpretation of the key axis can be compared to several alternative interpretations.
First, given the sentiment results (§4.1), one alternative is that the axis is a sentiment axis: one that tracks positive or negative sentiment about various subject matters. More positive sentiment may in turn be correlated with less backtracking on math, higher confidence, and less refusal. A related explanation is that the axis is a valence axis: one that tracks positive and negative evaluations in general. We test these hypotheses in Appendix F. We find that sentiment and emotion-valence axes fail to drive backtracking, and that projecting them out from our axis yields a residual that does.
Second, given the confidence results (§4.3), another alternative explanation is that the axis is a confidence axis: one that tracks the system’s confidence in its judgments. Higher confidence may cause more positive sentiment, less backtracking on math, and less refusal. While we cannot fully rule out this alternative, we show evidence (§5.2) that confidence alone cannot explain the tracking ability of the axis. As a result, we reject the hypothesis that the axis tracks only confidence. At the same time, our evidence does not exclude the hypothesis that the axis tracks both confidence and functional welfare.
In addition to these results, theoretical reasons suggest that functional welfare serves as a default hypothesis. At least in maze-trained models, our axis is a reward axis, and reward is intimately tied to functional welfare and to the system’s goals, while it is dissociable from confidence and sentiment. Combined, these theoretical and empirical considerations lead us to regard functional welfare as the most important factor here, though the question remains open.
Limitations.¶
We extract concept vectors from off-policy trajectories rather than from the trained policy’s own rollouts. We have not validated our LLM judges against human ratings, although we cross-check them against a different judge model in Appendix E. We train in only one environment, so we cannot fully rule out that the recruited axis is shaped by features of this particular setting. Although our controls give us confidence that these effects are robust in Qwen-family models, we trained only one non-Qwen model. Appendix T discusses these and other limitations further.
Conclusion.¶
The simplest way for the model to learn to navigate the maze would be to learn a narrow distinction among the three emoji types. Such a representation would have no effect on any behavior outside the maze. Yet our concept vectors do have such effects, modulating sentiment, math, refusal, and confidence. So post-training aligns the rewarded tokens’ representation with a global evaluative axis, rather than learning a task-specific discrimination. We argue that this axis is best described as a functional welfare axis. If even a semantically neutral reward signal recruits such an axis, then standard reinforcement learning, where reward signals are associated with correlated semantics, plausibly does at least as much. We conjecture that this entanglement between reward and functional welfare is a general mechanism by which reinforcement learning reshapes language model behavior.
Acknowledgments¶
We thank our collaborators and colleagues for discussion and feedback. Jacob Pfau inspired the design of the maze environment and of the relevant research questions. Matilda Gibbons introduced us to animal science that shaped our intuitions. Discussions with Ned Block, Patrick Butlin, Dillon Plunkett, Robert Long, and Nicolas Shea sharpened the interpretation of our empirical results.
Most computations were carried out on NYU’s Torch cluster; additional compute was furnished by academic grants from Thinking Machines Lab (Tinker API) and by Modal’s NeurIPS grant program.
D.C. is supported by funding from Macroscopic Ventures and Longview Philanthropy.