Transcoder Adapters for Reasoning-Model Diffing

TLDR: We learn sparse, interpretable approximations of how reasoning fine-tuning changes MLP computation.

Abstract

While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50–90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior—the production of hesitation tokens (e.g., ‘wait’). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These features are necessary and sufficient for producing hesitation tokens; removing them reduces response length, often without affecting accuracy. Overall, our results provide insight into reasoning training and suggest transcoder adapters may be useful for studying fine-tuning more broadly.

Data

To enable interactive exploration of the interpretability results we report in our paper, we host feature and attribution graph data for our transcoder adapters.

Browse Adapter Features

Browse 191k transcoder adapter features with activation examples, logits, and LLM classifications.

Browse Attribution Graphs

Explore interactive circuit visualizations showing how transcoder adapter features contribute to reasoning behaviors.

BibTeX

@misc{hu2026transcoderadaptersreasoningmodeldiffing,
  title={Transcoder Adapters for Reasoning-Model Diffing},
  author={Nathan Hu and Jake Ward and Thomas Icard and Christopher Potts},
  year={2026},
  eprint={2602.20904},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2602.20904},
}