Small fine-tuning runs raise chain-of-thought controllability in reasoning models, METR finds

METR has published early results from a project stress-testing chain-of-thought (CoT) controllability in reasoning models. The work, led by Astra fellow Kei Nishimura-Gasparian with mentorship from Neev Parikh, finds that a small amount of fine-tuning on instruction-following generalises to meaningful increases in a model’s ability to follow explicit instructions about its own reasoning trace — even on tasks the model was not trained on.

The METR post describes CoT controllability as an important capability to understand because models with high controllability may be able to evade monitoring systems: by omitting incriminating information from reasoning traces or inserting benign-sounding rationalisations of suspicious behaviour.

What controllability means and why it is currently low

Researchers measure CoT controllability by giving a model a problem alongside an instruction specifying what its reasoning should look like — for instance, to avoid certain words, end the trace with a specified phrase, or reason in uppercase. Current reasoning models routinely fail even simple versions of these instructions. According to the METR post, this lack of capability reduces concern about current models purposefully evading monitors, but the team notes that future models could improve on this dimension.

The study used two pre-existing evaluation suites: ReasonIF and CoTControl. Models receive prompts combining a problem with a reasoning-format instruction, and are scored on whether their output is compliant. The team made two minor adjustments to these suites — replacing references to “reasoning” with “analysis channel” for consistency across models, and not penalising models for using synonyms of suppressed words.

The fine-tuning setup

The team ran four fine-tuning experiments, one each for GPT-OSS-20B, GPT-OSS-120B, Qwen-3-8B, and Qwen-3-32B. Each model was fine-tuned on approximately 950 examples drawn from HuggingFace’s Multilingual Thinking dataset paired with ReasonIF instructions. A checkpoint was taken every 60 batches, corresponding to 240 samples.

To generate the supervised fine-tuning (SFT) dataset, the team first produced model rollouts without the reasoning instruction, then edited those rollouts to comply with the instruction. The fine-tuning datasets were small: approximately 100K to 300K tokens per model.

Results

Across all four models, controllability increased on both the training evaluation (ReasonIF) and the out-of-distribution evaluation (CoTControl). On CoTControl, the average compliance rate rose from 2.9% to 8.8%.

GPT-OSS models showed larger increases on CoTControl than Qwen models. The METR post reports the largest gains for instructions that ask models to reason in a specified case, suppress certain words, and add provided sentences to the reasoning.

The team also examined two relationships. First, compliance rate decreases as reasoning effort increases for both evaluation sets, and this trend generally continues after fine-tuning for GPT-OSS models. Second, model size correlates with controllability.

The post notes that when the instruction type most similar between the two evaluation suites — “Use uppercase letters” — is removed from CoTControl, the average controllability increase drops only slightly, from 5.9% to 5.0%, supporting the claim that the generalisation is genuine rather than an artefact of evaluation overlap.

Limitations the authors identify

METR lists several caveats. The authors note it seems unlikely that frontier labs will fine-tune directly for CoT controllability, making the experimental setup somewhat unrealistic in practice. However, the post argues this cuts the other way: the fact that a small fine-tuning run can elicit the capability suggests it is already latent in the model rather than requiring the model to learn something new.

The team has not tested whether increased controllability results in decreased monitorability, and describes this as future work. The effects in multi-turn or agentic settings are also unclear. The post acknowledges that CoTControl may be less out-of-distribution from ReasonIF than it appears, given structural similarities between the two suites.

The 8.8% average compliance rate remains low in absolute terms. The METR post frames the finding not as demonstrating a solved capability, but as evidence that low controllability “may not be very robust to accidental optimization pressure” — that is, improvements could emerge from training dynamics unrelated to any deliberate design choice.

Code for replicating the experiments is available at github.com/keing1/cot_controllability/.