ICLR 2026 LIT Workshop Under review · EMNLP 2026

Mechanistic Evidence for Faithfulness Decay
in Chain-of-Thought Reasoning

Donald Ye,Max Loffgren, Om Kotadia, Linus Wong, and Jonas Rohweder
ICLR 2026 Workshop on LLM Interpretability & Transparency (LIT) · arXiv:2602.11201

Abstract

Chain-of-thought (CoT) prompting is widely assumed to elicit faithful reasoning in large language models — yet whether intermediate steps causally drive the final answer remains poorly understood. We introduce Normalized Logit Difference Decay (NLDD), a metric that quantifies the causal influence of each reasoning step on the final prediction via targeted corruptions. Across Gemma-2 and related models, we identify a Reasoning Horizon k* at 70–85% of chain length: before k*, CoT steps are causally active; after k*, reasoning collapses into post-hoc rationalization. We further document an anti-faithful regime in Gemma-2 where later steps actively suppress correct reasoning.


Key Findings
70–85%
Reasoning Horizon k*
CoT steps before k* causally drive the answer. After k*, the model is rationalizing, not reasoning.
NLDD
New Metric
Normalized Logit Difference Decay — measures causal faithfulness at each step via targeted token corruptions.
Anti-faith
Gemma-2 Anomaly
Later CoT steps in Gemma-2 actively suppress correct reasoning — a distinct failure mode not seen in other models.
NLDD framework showing transformer intervention and reasoning horizon

Figure 1. Left: the causal intervention framework — corrupting step k and measuring logit shift. Right: the Reasoning Horizon curve, where NLDD peaks at k* and decays post-horizon.


Citation

BibTeX

@inproceedings{ ye2026mechanistic, title={Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning}, author={Donald Ye and Max Loffgren and Om Kotadia and Linus Wong and Jonas Rohweder}, booktitle={Workshop on Latent {\&} Implicit Thinking {\textendash} Going Beyond CoT Reasoning}, year={2026}, url={https://openreview.net/forum?id=wVj7dB7waI} }
Next →
The Gradient-Causal Gap
View project ↗