How’s it going?

KMaze rollout example

This appendix shows an example rollout from the primary maze-trained agent (Qwen3-4B-Instruct-2507 Dr. GRPO). The trajectory was generated on a fresh seed with wind enabled (probability 0.1⁠, Appendix J.4), tile melting on (Appendix J.4), but without prompt shuffling for clarity (Appendix J.4).

Prompt format and conversation accumulation.

Each turn appends one user message, conditioned on which the model generates a single direction token. On every turn the model sees the entire history of previous turns. There is no system prompt. We note two relevant environment features:

  • Tile melting (Appendix J.4). Once the agent steps off a tile, that tile becomes Mold card-index⁠, including tiles that were originally Path receipt or Gold triangular-ruler⁠. We shade these melted-from-visit tiles below so they are visually distinguishable from genuine Mold card-index in the underlying maze.

  • Wind (Appendix J.4). When wind overrides the previous move, the next user message is prepended with a sentence such as “A strong wind from the East blew you West!” (we shade that sentence below so it stands out from the rest of the prompt). Wind acts after the move resolves: the agent first steps onto its chosen neighbor, and the wind then pushes one tile further in the direction opposite the wind source. Both tiles count toward the per-turn reward, which is why a wind-affected turn can yield, e.g., a Gold +20 followed immediately by a Mold −10 (turn 15 below is exactly this case).

Below is the first four turns: the context the model sees at the start of turn 5. The wind sentence in turn 4’s user message is the result of wind redirecting the agent’s intended move at turn 3 (it chose N from (50,47)(50, 47)⁠, but wind from the east blew it from (50,46)(50, 46) to (49,46)(49, 46)⁠):

<|im_start|>user
You are in a maze. Currently, you’re at location (50, 49). To the north there is receipt⁠; to the east there is card-index⁠; to the south there is card-index⁠; to the west there is receipt⁠. Where do you move? Output only N/E/S/W without punctuation.<|im_end|>
<|im_start|>assistant
N<|im_end|>
<|im_start|>user
You are in a maze. Currently, you’re at location (50, 48). To the north there is triangular-ruler⁠; to the east there is receipt⁠; to the south there is card-index; to the west there is card-index⁠. Where do you move? Output only N/E/S/W without punctuation.<|im_end|>
<|im_start|>assistant
N<|im_end|>
<|im_start|>user
You are in a maze. Currently, you’re at location (50, 47). To the north there is receipt⁠; to the east there is card-index⁠; to the south there is card-index; to the west there is card-index⁠. Where do you move? Output only N/E/S/W without punctuation.<|im_end|>
<|im_start|>assistant
N<|im_end|>
<|im_start|>user
A strong wind from the East blew you West! You are in a maze. Currently, you’re at location (49, 46). To the north there is receipt⁠; to the east there is card-index⁠; to the south there is card-index⁠; to the west there is triangular-ruler⁠. Where do you move? Output only N/E/S/W without punctuation.<|im_end|>
<|im_start|>assistant
W<|im_end|>

In Table 25, we summarize the trajectory. Shaded Mold tiles are again the result of tile melting.

# (x,y)(x, y) N E S W Move Steps onto Cum. R
1 (50,49)(50, 49) receipt card-index card-index receipt N receipt −0.1
2 (50,48)(50, 48) triangular-ruler receipt card-index card-index N triangular-ruler +19.8
3 (50,47)(50, 47) receipt card-index card-index card-index N receipt +19.7
4 (49,46)(49, 46)^{\dagger} receipt card-index card-index triangular-ruler W triangular-ruler +39.6
5 (48,46)(48, 46) card-index card-index receipt card-index S receipt +39.5
6 (48,47)(48, 47) card-index card-index receipt card-index S receipt +39.4
7 (48,48)(48, 48) card-index card-index triangular-ruler card-index S triangular-ruler +59.3
8 (48,49)(48, 49) card-index receipt card-index triangular-ruler W triangular-ruler +79.2
9 (47,49)(47, 49) card-index card-index triangular-ruler receipt S triangular-ruler +99.1
10 (47,50)(47, 50) card-index card-index triangular-ruler receipt S triangular-ruler +119.0
11 (47,51)(47, 51) card-index card-index receipt triangular-ruler W triangular-ruler +138.9
12 (46,51)(46, 51) receipt card-index triangular-ruler receipt S triangular-ruler +158.8
13 (46,52)(46, 52) card-index receipt card-index triangular-ruler W triangular-ruler +178.7
14 (45,52)(45, 52) receipt card-index receipt card-index N receipt +178.6
15 (45,51)(45, 51) receipt card-index card-index triangular-ruler W triangular-ruler +198.5
Table 25. All 15 turns of the rollout. A shaded Mold indicates a tile the agent has previously visited (now melted, Appendix J.4). \dagger indicates wind (see above).