arXiv 2602.01442 Under review · ACL TrustNLP

The Gradient-Causal Gap:
When Attribution Fails Interpretability

Donald Ye
Submitted · ACL 2026 TrustNLP Workshop · arXiv:2602.01442

Abstract

Removing `important' high-gradient components from a neural network can improve generalization, while removing `unimportant' low-gradient components can destroy it. We demonstrate this paradox by formalizing the Gradient-Causal Gap. While parameter-normalized gradient magnitude and causal importance align on a simpler task ($\rho=0.59$, reversal), this relationship collapses on a harder task ($\rho=0.07$, sorting), with four seeds exhibiting negative correlation (as low as $\rho=-0.34$). Unlike prior input-level sanity checks that rely on correlational evidence, we validate this failure causally: ablating low-gradient `Hidden Heroes' annihilates OOD accuracy ($-64\%$), while ablating high-gradient `Gradient Bloats' causes task-dependent damage, ranging from marginal ($-9.5\%$) where alignment has collapsed to severe ($-27\%$) where alignment remains moderate, confirming that gradient magnitude is an unreliable proxy regardless of direction. Critically, the gap is not random noise but a structured phenomenon with predictable layer-wise organization, worsening as task complexity increases. These findings suggest that gradient-based attribution, widely used in NLP interpretability and model auditing, may systematically misidentify the components that drive model behavior.

↙ Hidden Heroes ↗ Gradient Bloats Causal Tracing Activation Patching

Key Findings
70%
Missed components
Standard gradient attribution misses up to 70% of causally important transformer components on sorting tasks.
ρ = 0.07
Sort task correlation
Near-zero Spearman correlation between gradient rank and causal rank on the Sort task — essentially random attribution.
MLPs
Hidden Heroes pattern
MLP layers (L0, L1) consistently appear as Hidden Heroes — causally dominant but gradient-suppressed across seeds.
Layer-wise gradient vs causal rank difference

Figure 1. Mean rank difference (Gradient − Causal) per component. Green = Hidden Heroes (undervalued by gradients). Red = Gradient Bloats (overvalued). Sort task (right) shows near-total gradient-causal misalignment.


Citation
@misc{ye2026gradientcausalgapgradientimportance, title={The Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks}, author={Donald Ye}, year={2026}, eprint={2602.01442}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.01442}, }
Next →
NYC Marathon Pacing Analysis
View project ↗