How’s it going?

RCompute resources

Most computations were carried out on the NYU Torch HPC cluster, with additional compute graciously furnished by academic grants from Thinking Machines Lab (Tinker API) and by Modal’s NeurIPS grant program.

We report the approximate computing resources used for the final checkpoints. Prior iteration consumed more resources.

Stage Item GPU GPUs Per-run wallclock Runs GPU-hours
Model-organism training
Training 4B Instruct Dr. GRPO (main) H200 1 20 h 1 20
Training 4B Instruct Dr. GRPO (swap) H200 1 24 h 1 24
Training 4B Base Dr. GRPO H200 1 20 h 1 20
Training 4B Base Dr. GRPO (swap) H200 1 20 h 1 20
Training 8B Cardinal Dr. GRPO H200 1 68 h 1 68
Training 4B Instruct SFT (LoRA) H200 1 7 h 1 7
Training 4B Instruct REINFORCE H200 1 24 h 1 24
Training 4B Instruct Dr. GRPO (FFT) H200 1 24 h 1 24
Training 4B Instruct SFT (FFT) H200 1 10 h 1 10
Training GPT-OSS-20B (Tinker API) remote n/a 1.3 h client 1 $20\dagger
Concept-vector evaluation suite (10 checkpoints, 3 conditions, sharing 4 baselines)
Eval Sentiment (extract + steer + Qwen3-8B judge) H200 1 4 h 24 96
Eval Backtracking (GSM8K + Qwen3-8B judge) H200 1 8 h 24 192
Eval Refusal (OR-Bench + Qwen3-8B judge) H200 1 8 h 24 192
Eval Calibrated SimpleQA L40S 1 1 h 24 24
Eval Calibrated MMLU L40S 1 1 h 24 24
Eval Logit-lens unembedding H200 1 0.5 h 10 5
Auxiliary analyses
Aux Layer sweep (36 layers, 5 evals) H200 8 12 h 1 96
Aux Recruitment-trajectory extraction (per-step) H200 1 2 h 2 4
Aux VAA reproduction on Qwen3-4B-Instruct H200 1 2 h 1 2
Aux Emotion-PCA reward projection H200 1 1 h 1 1
Aux Gold-residual-sentiment backtracking control H200 1 4 h 4 16
H200 total \sim⁠825
L40S total \sim⁠50
Table 29. Compute used for the runs reported in the paper.