arXiv 2602.01442 Under review · ACL TrustNLP

The Gradient-Causal Gap:
When Attribution Fails Interpretability

Donald Ye

Submitted · ACL 2026 TrustNLP Workshop · arXiv:2602.01442

arXiv PDF Code

Abstract

Removing `important' high-gradient components from a neural network can improve generalization, while removing `unimportant' low-gradient components can destroy it. We demonstrate this paradox by formalizing the Gradient-Causal Gap. While parameter-normalized gradient magnitude and causal importance align on a simpler task ($\rho=0.59$, reversal), this relationship collapses on a harder task ($\rho=0.07$, sorting), with four seeds exhibiting negative correlation (as low as $\rho=-0.34$). Unlike prior input-level sanity checks that rely on correlational evidence, we validate this failure causally: ablating low-gradient `Hidden Heroes' annihilates OOD accuracy ($-64\%$), while ablating high-gradient `Gradient Bloats' causes task-dependent damage, ranging from marginal ($-9.5\%$) where alignment has collapsed to severe ($-27\%$) where alignment remains moderate, confirming that gradient magnitude is an unreliable proxy regardless of direction. Critically, the gap is not random noise but a structured phenomenon with predictable layer-wise organization, worsening as task complexity increases. These findings suggest that gradient-based attribution, widely used in NLP interpretability and model auditing, may systematically misidentify the components that drive model behavior.

↙ Hidden Heroes ↗ Gradient Bloats Causal Tracing Activation Patching

Key Findings

70%

Missed components

Standard gradient attribution misses up to 70% of causally important transformer components on sorting tasks.

ρ = 0.07

Sort task correlation

Near-zero Spearman correlation between gradient rank and causal rank on the Sort task — essentially random attribution.

MLPs

Hidden Heroes pattern

MLP layers (L0, L1) consistently appear as Hidden Heroes — causally dominant but gradient-suppressed across seeds.

Layer-wise gradient vs causal rank difference

Figure 1. Mean rank difference (Gradient − Causal) per component. Green = Hidden Heroes (undervalued by gradients). Red = Gradient Bloats (overvalued). Sort task (right) shows near-total gradient-causal misalignment.

Citation

@misc{ye2026gradientcausalgapgradientimportance, title={The Gradient-Causal Gap: Why Gradient Importance Fails on Complex Tasks}, author={Donald Ye}, year={2026}, eprint={2602.01442}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.01442}, }

NYC Marathon Pacing Analysis

View project ↗

The Gradient-Causal Gap:When Attribution Fails Interpretability

The Gradient-Causal Gap:
When Attribution Fails Interpretability