A Unified View of Attention Sinks and Residual Sinks: LLM 'Outliers' as a Training-Stability Mechanism
Contents
If you spend time around LLMs, you eventually hear about the odd behavior where attention keeps concentrating on the first token in a sequence. This phenomenon, known as an attention sink, was a long-running mystery, but a recent paper gives a convincing explanation.
Paper overview
A Unified View of Attention and Residual Sinks: Outlier-driven Rescaling is Essential for Transformer Training (January 2026)
The paper offers a unified explanation for two different outlier phenomena observed in Transformers.
What are attention sinks
In a Transformer’s attention mechanism, an attention sink is a phenomenon where a specific token, usually near the beginning of the sequence, receives abnormally high attention weight.
In practice, that token is not necessarily important. Even so, the behavior appears consistently across many models.
What are residual sinks
In a Transformer’s residual connections, a residual sink is a phenomenon where specific dimensions keep extremely large activation values.
Again, it is not because those dimensions have some uniquely meaningful semantic role, yet the large values remain.
The unified explanation: outlier-driven rescaling
The paper’s claim is straightforward.
Both phenomena play the same role: they work together with normalization to rescale the other components.
- Attention sinks work with softmax to adjust how attention is distributed across other tokens
- Residual sinks work with RMSNorm to adjust the values of other dimensions
In other words, these outliers are not meaningless garbage collectors. They are implicit adjustment mechanisms that help stabilize training.
Practical outcome
Based on that understanding, the paper also proposes improvements:
- Absorbing the outliers into learnable parameters
- Explicit control through gated rescaling
The reported results are:
- An average improvement of 2 points
- Better robustness under quantization, such as INT8
Mostly irrelevant if you only use LLMs
To be honest, there is almost nothing here that directly helps people who only consume LLMs through APIs. Even for local inference, if you are using existing models as-is, this insight does not immediately change much.
Still, it does help explain why quantized models sometimes lose quality. If quantization crushes these outliers, the rescaling mechanism breaks down. The obvious countermeasures are still the usual ones, such as using a larger model or a higher bit-width quantization setting.
It is an interesting paper mainly because it gives a clearer look at what is happening inside LLMs.