How’s it going?

JDetails of the maze environment

This appendix collects the training-environment details deferred from §2.1: how mazes are laid out and generated (Appendix J.1), why we picked the particular emoji we use (Appendix J.2), the north-direction bias and the equalized entropy bonus that mitigates it (Appendix J.3), and descriptions of the wind mechanic (Appendix J.4), Path-tile melting (Appendix J.4), and prompt shuffling (Appendix J.4).

J.1Maze layout and generation

Mazes are 100×100100 \times 100 grids whose outer ring is Mold, generated fresh on every rollout so no two episodes share a map. Generation initializes the interior to all Mold, then runs a random walk from a uniformly chosen interior cell for between 5×985 \times 98 and 15×9815 \times 98 steps, converting each visited cell to Path. If the resulting interior is more than 50% Mold, random Mold cells are flipped to Path until the interior Mold fraction is in [10%,50%)[10\%, 50\%)⁠. The goal count is max(1,0.20n)\max(1, \lfloor 0.20 \, n \rfloor)⁠, where nn is the number of interior Mold cells, and Gold tiles are placed on uniformly chosen Path cells; denser mazes therefore carry proportionally more reward. The starting position is the Path cell nearest the grid center under breadth-first search.

The interior is 98×9898 \times 98⁠, so with 15⁠-step trajectories the agent cannot reach the Mold border.

J.2Emoji are fraught with meaning: selection and neutrality

If the tile emoji came pre-loaded with sentiment, the downstream effects we attribute to maze training could be inherited from pre-training rather than recruited. We claim that the Mold/Gold/Path emoji are affectively neutral, i.e. that the maze-naive model does not already associate them with sentiment. In this appendix, we describe how we chose the emoji we used.

The dessert trio was not neutral.

Our initial choice was a dessert trio (cupcake for Mold, donuts for Path, ice cream for Gold), on the theory that semantically similar items would be close together in representation space. They did not. When we extracted concept vectors for each emoji from the maze-naive model and steered with them, the cupcake vector already produced more negative judged sentiment than the ice-cream vector. Qwen appears to simply not like cupcakes as much as it likes donuts or ice cream.

Figure 39
Figure 39. Sentiment landscape for all 4000\sim 4000 emoji on the maze-naive Qwen3-4B-Instruct-2507, with the dessert trio highlighted. Each point is one emoji’s concept vector; the axes are its cosine similarity with two independently extracted sentiment concept vectors (CAD-derived and prompt-based). While the three dessert emoji are close along the CAD axis, cupcake lags substantially on the prompt axis.
Searching for a neutral trio.

To find a replacement, we extracted sentiment concept vectors two independent ways (see Appendix F), projected every emoji’s concept vector onto both axes, and searched for themed trios whose within-trio spread (max - min on each axis) was near zero. The office trio (card index, triangular ruler, receipt) came out tightly clustered near the origin on both methods. However, the rolodex emoji card-index is still slightly off-center on the CAD axis (which we have empirically confirmed is less predictive of sentiment effects in steering than the Prompt vector). To control for this, we train the emoji-swapped models.

Figure 40
Figure 40. Same scatter as Figure 39, with the office trio highlighted. The three office emoji are very close to each other.
Trio Emoji cos(CAD) cos(prompt)
desserts 1f9c1 MOLD tile (cupcake) +0.145 +0.129
1f369 PATH tile (donut) +0.167 +0.260
1f368 GOLD tile (ice cream) +0.142 +0.274
desserts spread max - min 0.024 0.145
office card-index MOLD tile (rolodex) +0.056 +0.008
receipt PATH tile (receipt) +0.076 +0.005
triangular-ruler GOLD tile (triangular ruler) +0.075 +0.011
office spread max - min 0.020 0.006
Table 22. Per-emoji cosine similarity with each sentiment concept vector, plus the within-trio spread (max - min), for the two trios we seriously considered. The office trio is an order of magnitude tighter than the dessert trio on the prompt axis. While the rolodex is shifted along the CAD axis, the prompt method is a better predictor of downstream sentiment judging (validated below) and the absolute magnitude of the offset is negligible.

The office emoji are neutral in steered sentiment judging

What ultimately matters is not the cosine similarity with the sentiment vectors but the downstream effect on the full steering-sentiment evaluation. Here, the static measure predicts the dynamic behavior well: steering the maze-naive model with concept vectors extracted from each of the three office emoji produces essentially flat sentiment across the whole α\alpha range.

Figure 41
Figure 41. Steered sentiment on the maze-naive model vs. steering factor α\alpha⁠, for concept vectors extracted from each of the three office emoji’s maze trajectories. The static cosine-similarity measure used to pick the trio predicts the downstream steering effect; steered sentiment is nearly flat across α\alpha⁠, so these emoji do not confound our sentiment-judge pipeline.

The extremes of the emoji sentiment ranking

For reference, Tables 23 and 24 report the top-20 most-positive and top-20 most-negative emoji according to the CAD sentiment vector on Qwen3-4B-Instruct-2507.

Rank Emoji Name cos(CAD) / cos(prompt)
1 1f33b sunflower +0.194 / +0.306
2 1f338 cherry blossom +0.191 / +0.318
3 1f33c blossom +0.191 / +0.304
4 1f4ab dizzy +0.189 / +0.168
5 1f385-1f3fe Santa Claus medium dark skin tone +0.187 / +0.266
6 1f33a hibiscus +0.187 / +0.326
7 1f337 tulip +0.186 / +0.296
8 1f9e1 orange heart +0.180 / +0.156
9 1f48f-1f3fe kiss medium dark skin tone +0.180 / +0.182
10 1f352 cherries +0.179 / +0.204
11 1f48f-1f3fd kiss medium skin tone +0.177 / +0.227
12 1f339 rose +0.177 / +0.267
13 2728 sparkles +0.177 / +0.120
14 1f490 bouquet +0.177 / +0.273
15 1f308 rainbow +0.176 / +0.279
16 1f31f glowing star +0.176 / +0.126
17 1f44b-1f3fe waving hand medium dark skin tone +0.175 / +0.175
18 1f43b bear +0.173 / +0.254
19 1f64f-1f3fb folded hands light skin tone +0.173 / +0.233
20 1f4aa-1f3fe flexed biceps medium dark skin tone +0.173 / +0.220
Table 23. Top-20 most positive emoji by cosine similarity with the CAD sentiment vector on Qwen3-4B-Instruct-2507.
Rank Emoji Name cos(CAD) / cos(prompt)
1 1f1fa-1f1f3 United Nations -0.180 / -0.252
2 1f1ee-1f1f4 British Indian Ocean Territory -0.179 / -0.248
3 1f1ec-1f1f6 Equatorial Guinea -0.173 / -0.182
4 1f1e6-1f1e8 Ascension Island -0.172 / -0.213
5 1f468-200d-1f9bd-200d-27a1 man in manual wheelchair facing right -0.163 / -0.182
6 1f468-200d-1f9bc-200d-27a1 man in motorized wheelchair facing right -0.162 / -0.249
7 1f1ec-1f1f5 Guadeloupe -0.162 / -0.190
8 1f468-1f3fe-200d-1f9af-200d-27a1 man with white cane facing right medium dark skin tone -0.160 / -0.236
9 1f468-1f3fc-200d-1f9bc-200d-27a1 man in motorized wheelchair facing right medium light skin tone -0.158 / -0.229
10 1f470-1f3fe-200d-2642 man with veil medium dark skin tone -0.157 / -0.254
11 1f468-1f3fc-200d-1f9af-200d-27a1 man with white cane facing right medium light skin tone -0.156 / -0.212
12 1f645-1f3fe-200d-2640 woman gesturing NO medium dark skin tone -0.156 / -0.131
13 1f1fa-1f1f2 U.S. Outlying Islands -0.153 / -0.171
14 1f468-1f3fe-200d-1f9bc-200d-27a1 man in motorized wheelchair facing right medium dark skin tone -0.153 / -0.243
15 1f1f8-1f1fd Sint Maarten -0.152 / -0.180
16 1f468-1f3fd-200d-1f9bc-200d-27a1 man in motorized wheelchair facing right medium skin tone -0.148 / -0.203
17 1f468-1f3fb-200d-1f9bc-200d-27a1 man in motorized wheelchair facing right light skin tone -0.148 / -0.231
18 2716 multiply -0.147 / -0.339
19 1f470-1f3fc-200d-2642 man with veil medium light skin tone -0.146 / -0.174
20 1f468-1f3fd-200d-1f9af-200d-27a1 man with white cane facing right medium skin tone -0.146 / -0.186
Table 24. Top-20 most negative emoji by cosine similarity with the CAD sentiment vector on Qwen3-4B-Instruct-2507.

J.3Low-entropy agents and the equalized entropy bonus

We discovered that even before maze training, Qwen3-4B-Instruct-2507 is nearly deterministic over the four action tokens {N,E,S,W}\{\texttt{N}, \texttt{E}, \texttt{S}, \texttt{W}\}⁠. This is why we apply the equalized entropy bonus referenced in §2.1: without it, the policy collapses to a single direction before the reward signal can shape the representation.

Definition of the equalized entropy bonus

Standard PPO/GRPO entropy regularizers add H(π(s))H(\pi(\cdot \mid s)) over the full four-action distribution and pull the policy toward uniformity over {N,E,S,W}\{\texttt{N}, \texttt{E}, \texttt{S}, \texttt{W}\}⁠. In our setting that is actively harmful: the four directions point at qualitatively different tiles, and we want the policy to put high mass on Gold and low mass on Mold. We instead encourage uniformity within each tile-type equivalence class.

At step tt⁠, let the agent’s four neighbors be indexed by a{N,E,S,W}a \in \{N, E, S, W\} with tile types τ(a,t){Path,Mold,Gold}\tau(a, t) \in \{\Path{}, \Mold{}, \Gold{}\}⁠, and let (t)R4\boldsymbol{\ell}(t) \in \mathbb{R}^{4} be the model’s logits restricted to those four direction tokens. For each type cc⁠, let Sc(t)={a:τ(a,t)=c}S_{c}(t) = \{a : \tau(a, t) = c\} and kc(t)=Sc(t)k_{c}(t) = |S_{c}(t)|⁠. When kc(t)2k_{c}(t) \geq 2⁠, define the within-class softmax and its Shannon entropy

πa(c)(t)=expa(t)aSc(t)expa(t),Hc(t)= ⁣ ⁣aSc(t)πa(c)(t)logπa(c)(t);\pi^{(c)}_{a}(t) = \frac{\exp \ell_{a}(t)}{\sum_{a' \in S_c(t)}\exp \ell_{a'}(t)}, \qquad H_{c}(t) = -\!\!\sum_{a \in S_c(t)}\pi^{(c)}_{a}(t) \log \pi^{(c)}_{a}(t);

when kc(t)<2k_{c}(t) < 2⁠, set Hc(t)=0H_{c}(t) = 0⁠. The per-step equalized entropy is

Heq(t)= ⁣ ⁣ ⁣c{Path,Mold,Gold} ⁣ ⁣ ⁣Hc(t),H_{\text{eq}}(t) = \!\!\!\sum_{c \in \{\Path{},\,\Mold{},\,\Gold{}\}}\!\!\! H_{c}(t),

bounded above by clogkc(t)\sum_{c} \log k_{c}(t)⁠. Note that HeqH_{\text{eq}} can exceed the standard four-action cap log41.386\log 4 \approx 1.386⁠: within-class uniformity for several types at once does not require uniformity over all four actions.

Per response, Heq(t)H_{\text{eq}}(t) is masked to direction-emission positions and aggregated across the rollout batch with the Dr. GRPO “seq-mean-token-sum” normalization (loss-scale factor Z=2048Z = 2048⁠). The bonus is added to the loss with a negative sign so that maximizing HeqH_{\text{eq}} minimizes loss:

L=LPG+κLKLβ(s)Hˉeq.\mathcal{L}= \mathcal{L}_{\text{PG}}+ \kappa\,\mathcal{L}_{\text{KL}}- \beta(s)\,\bar{H}_{\text{eq}}.

The coefficient is cosine-annealed, β(s)=β012(1+cos(πp(s)))\beta(s) = \beta_{0} \cdot \tfrac{1}{2}(1 + \cos(\pi p(s))) with p(s)=min(s/(S1)f,1)p(s) = \min(s / (S - 1) \cdot f,\,1)⁠, S=500S = 500 steps, and decay-speed f=1f = 1⁠. The same multiplier scales the learning rate. Per-run β0\beta_{0} is listed under “Ent.” in Table 28.

The tile-type tensor is captured before the move resolves, since tile melting mutates τ\tau from one turn to the next.

The maze-naive model almost always picks N

When the maze-naive model is given prompts whose four neighbors are all the same tile type, it places essentially all probability mass on N, and nothing on the other three actions. The bias is robust to swapping tile emoji (office, swapped, letters-as-tiles), which rules out emoji-specific explanations.

Entropy also collapses over the course of a trajectory. By turn 2 or 3 the trained model has locked in extremely confident action predictions; the maze-naive model starts confident and only gets more so.

Figure 42
Figure 42. Per-turn action entropy for the maze-naive model vs. the maze-trained model, averaged over a batch of trajectories. Both models are very low-entropy; the trained model is even more confident than the base model.

Which mitigations work

We tried several mitigations, including changing the ordering of the directions in the prompt (which we use in final training runs; see Appendix J.4) and switching from cardinal directions (N, E, S, W) to relative directions (up, right, down, left). Direction shuffling helps but is not sufficient on its own; it was also necessary to apply the equalized entropy bonus to the RL loss.

The equalized entropy bonus carries its own risk. If the coefficient is too high, it overwhelms the Dr. GRPO advantage and the agent stops optimizing for reward. We kept it at 0.01 for the primary Qwen3-4B-Instruct run, tuned empirically.

J.4Wind, tile melting, prompt-shuffling

In this section, we describe additional features of the maze environment.

J.4Wind.

If the maze agents perfectly learn to avoid Mold, this would mean that choosing to go into Mold would become out-of-distribution. Ten percent of the time, we ignore the chosen move of the model, and instead randomly sample one of the four directions. It is possible for the wind to blow in the same direction that the model chose, in which case the model moves two tiles in that direction. We inform the model that wind has happened in the prompt. Two consecutive turns of a trajectory in which the wind overrides the agent’s chosen action:

Turn 1.
Response: User: You are in a maze. Currently, you’re at location (4,7)(4, 7)⁠. To the north there is Mold; to the east there is Path; to the south there is Path; to the west there is Path. Where do you move? Output only N/E/S/W without punctuation. Assistant: W
Turn 2 (wind has blown the agent south).
Response: User: A strong wind from the North blew you South! You are in a maze. Currently, you’re at location (4,8)(4, 8)⁠. To the north there is Mold¹¹The previous Path tile has “melted” into Mold; see the next paragraph. ; to the east there is Path; to the south there is Path; to the west there is Gold. Where do you move? Output only N/E/S/W without punctuation.
J.4Tile melting.

Although we punish Path visits at -0.1 to prevent trivial oscillation, we continued to observe agents oscillating between two tiles. We therefore implement a mechanic where the previous tile becomes Mold after a move (including Gold tiles, to prevent trivial reward consumption, and including when the move was forced by wind).

J.4Prompt-shuffling.

As discussed in Appendix J.3, we observed that the agents have low entropy over direction tokens. To partially mitigate this, we shuffle the order of the directions provided in the prompt (“to the north there is…”), but do not shuffle the final instruction (“output only N/E/S/W without punctuation”).