MoonshotAI (Kimi) proposed AttnRes to replace Transformer's residual connection with Attention, 1.25 times more computationally efficient.
Contents
On March 16, 2026, the MoonshotAI (Kimi) team published Attention Residuals (AttnRes), a method that replaces the residual connections of Transformer with attention in the depth direction, on arXiv (arXiv:2603.15031). The implementation is published on GitHub.
What is residual connection?
The expressive power of a neural network improves as the layers become deeper, but if the layers are made too deep, “vanishing gradient” (a phenomenon in which the learning signal disappears in intermediate layers) occurs, and performance decreases. Residual connection (skip connection) was proposed in ResNet (residual network) in 2015, and it is a simple mechanism that adds input directly to the processing results of each layer.
graph TD
IN[層の入力] --> ATT[Attention<br/>or FFN]
ATT --> ADD((+))
IN -->|残差接続| ADD
ADD --> OUT[出力]
Written as a mathematical formula, it is h_l = h_{l-1} + f(h_{l-1}). f is the processing of each layer (Attention and FFN), and the input h_{l-1} is added to its output as is. This allows gradients to reach deep layers through “shortcuts,” making it possible to learn stably even in networks with more than 100 layers. The current Transformer uses this residual connection in all layers.
What happens in a normal Transformer
Residual connection was a breakthrough in deep learning, but let’s summarize how a normal Transformer handles information.
Let’s consider a 48-layer Transformer. Once the input token embedding vector enters the first layer, the following is repeated for each layer:
- receive the output of the previous layer
- Process with Attention (or FFN)
- Add processing result + output of previous layer as is (residual connection)
- pass to next layer
Point number 3. Since all layers use fixed addition with a weight of 1, the output of the 48th layer is an even mixture of the processing results of the 48 layers. There is no mechanism for each layer to decide whether information on this layer is important'' or information on this layer is unnecessary.”
graph TD
subgraph 通常のTransformer
X0[入力埋め込み] --> L1[Layer 1]
L1 -->|"+1"| A1((+))
X0 -->|"+1"| A1
A1 --> L2[Layer 2]
L2 -->|"+1"| A2((+))
A1 -->|"+1"| A2
A2 --> L3[Layer 3 ...]
L3 -->|"+1"| A3((+))
A2 -->|"+1"| A3
A3 --> OUT1[最終出力]
end
This structure is the same for both CNN for image recognition and Transformer for language models, and has remained virtually unchanged since ResNet in 2015.
Dilution problem
What exactly does equal addition cause?
For example, suppose Layer 30 learns that “the type information of the programming language is important in this context” and sends out a strong signal. However, until the final output is reached, the outputs from the 31st to 48th layers are added with the same weight. The signal in the 30th layer is diluted by 18 additions.
There is also a problem in the opposite direction, where the low-level syntactic information issued by the fifth layer remains with the same weight until the end. At the stage where the 48th layer is performing advanced inference, unnecessary information from the initial layer is mixed in as noise.
This is dilution, and the norm of the hidden state increases in O(L) in proportion to the number of layers L. This is especially noticeable in modern LLM, where PreNorm (a method of applying LayerNorm before Attention/FFN) has become standard, and the relative contribution of each layer becomes smaller and smaller.
Fundamental differences between regular deep learning and AttnRes
In normal deep learning (all ResNet-based architectures), information propagation between layers is statically hard-coded. If the designer decides to add the output of the previous layer with a weight of 1, the distribution will not change no matter what input is received. Whether it’s a math problem, code generation, or translation, information is mixed in exactly the same even distribution.
| Regular deep learning | AttnRes | |
|---|---|---|
| Information propagation between layers | Fixed (addition of weight 1) | Dynamic (softmax attention) |
| Reference | Only the previous layer | All previous layers |
| Adaptation to input | None | Weights change for each input |
| Unnecessary layer information | Accumulates as is | Can be suppressed by lowering the weight |
| Information from distant layers | Can only be reached via the middle layer | Can be directly referenced |
AttnRes dynamically determines “which previous layer’s information to use and how much” according to the input. For code generation tasks, it is possible to increase the weight of layers that are strong in syntax analysis, and for mathematical inference, it is possible to increase the weight of layers that capture logical structure.
Inspiration from the success of Transformer
There is an important analogy that the paper points out. Transformer successfully replaces RNN’s fixed weight aggregation with softmax attention in the time series direction.
| Fixed weight aggregation | Attention aggregation | |
|---|---|---|
| Time series direction | RNN (inherits previous step with fixed weight) | Transformer (see selection of all times with softmax) |
| Depth direction | Residual connection (add previous layer with weight 1) | AttnRes (select all previous layers with softmax) |
This is the same structure as how Transformer solved the problem of RNN’s inability to retain long-distance information using attention. The idea behind AttnRes is to use attention to overcome the limits of fixed weights even in the depth direction.
How Attention Residuals works
While normal residual connection adds the outputs of the previous layer equally,'' AttnRes combines the outputs of all previous layers by weighting them with softmax attention.”
is the output of the layer, and is the softmax normalized attention weight. In normal residual connection, all are equal, but they are replaced with learnable weights.
graph TD
subgraph AttnRes
Y0[入力埋め込み] --> M1[Layer 1]
M1 --> S1{softmax<br/>attention}
Y0 -->|"α₀→₁"| S1
S1 --> M2[Layer 2]
M2 --> S2{softmax<br/>attention}
Y0 -->|"α₀→₂"| S2
M1 -->|"α₁→₂"| S2
S2 --> M3[Layer 3 ...]
M3 --> S3{softmax<br/>attention}
Y0 -->|"α₀→₃"| S3
M1 -->|"α₁→₃"| S3
M2 -->|"α₂→₃"| S3
S3 --> OUT2[最終出力]
end
The calculation formula for attention weight is as follows.
There are two important design choices.
Query is a fixed vector (input independent) for each layer. In experiments using input-dependent queries, the verification loss is slightly improved (1.731 vs. 1.737), but sequential memory access occurs during decoding. Fixed queries were adopted as a trade-off with throughput.
Apply RMSNorm to . Prevent layers with large magnitude from monopolizing attentional weight. If this is removed, the loss becomes worse (1.743).
Comparison with past methods
The characteristics of each method can be summarized by expressing the aggregation in the depth direction as a matrix M.
| Method | Type of weight | Reference range |
|---|---|---|
| Residual connection (standard) | Fixed (equal) | Only the previous layer |
| DenseFormer | Static scalar | Full front layer |
| mHC | Dynamic (input dependent) | m-stream |
| AttnRes Full | Dynamic (softmax) | Full front layer |
| AttnRes Block | Dynamic (softmax) | Previous layer aggregated in blocks |
DenseFormer has a configuration that references all previous layers but the weights are fixed'', and in the ablation experiment, the loss (1.767) was the same as the baseline loss (1.766). The fact that static weights (fixed regardless of input) has no effect is the basis for the conclusion that input-dependent dynamic weighting is essential. This is why the fixed allocation determined at the time of design” of normal deep learning had its limits in the first place.
Block AttnRes: Practical efficiency
The problem with Full AttnRes is that the memory cost of holding the outputs of all previous layers and the pipeline parallel communication cost increase in proportion to the depth. Block AttnRes solves this.
Divide the L layer into N blocks (each block S=L/N layer) and let each block be represented by the sum of its outputs . As a result, memory and communication costs are reduced from O(Ld) to O(Nd).
According to experiments, performance almost equivalent to Full AttnRes can be obtained with N≈8 blocks.
Optimization during training
Pipeline parallel learning requires each stage to receive block representations from the previous stage. The paper introduced cross-stage caching, a design in which each physical stage caches the block representation of the previous virtual stage.
In a naive implementation, the amount of communication increases in proportion to the cumulative number of pipeline chunks C (O(C²)), but after caching, it is compressed to the number of pipeline physical stages P (O(P²)), resulting in an improvement of V times. In steady state (1F1B schedule) it can completely overlap with computation, so actual training overhead is less than 4%.
Being able to use the regular Transformer training pipeline almost as is is of great practical importance. Changes to the learning code are minimal.
Optimization during inference
When decoding, Two-Phase Computation is used.
| Phases | Processing | Purpose |
|---|---|---|
| Phase 1 (parallel) | Batch calculation of inter-block attention in all S layers | Consolidation of memory accesses into one time |
| Phase 2 (sequential) | Calculate intra-block attention in each layer and combine with Phase 1 results using online softmax | Input-dependent fine-grained calculation |
By reducing the number of reads of the block representation from S times to 1 time, the inference latency increased by less than 2%. It maintains almost the same inference speed as normal Transformer.
Prefilling a long context (128K tokens) requires 15GB to hold the block representation, but this can be reduced to less than 0.3GB by combining tensor-parallel sharding and chunked prefilling (16K chunks).
Demonstration on Kimi Linear 48B
Results of integrating Block AttnRes into Kimi Linear architecture (48B total parameters / 3B activations, similar to DeepSeek-V3 design with MoE) and pretraining with 1.4T tokens.
| Benchmark | Baseline | AttnRes | Improvement |
|---|---|---|---|
| GPQA-Diamond | 36.9 | 44.4 | +7.5 |
| Math | 53.5 | 57.1 | +3.6 |
| HumanEval | 59.1 | 62.2 | +3.1 |
| MMLU | 73.5 | 74.6 | +1.1 |
| BBH | 76.3 | 78.0 | +1.7 |
| C-Eval | 79.6 | 82.5 | +2.9 |
| CMMLU | 82.0 | 82.9 | +0.9 |
The improvements are particularly significant in multi-step inference (GPQA-Diamond) and code generation (HumanEval). These tasks require accurate reference to abstract representations built in deep layers, and are areas where normal Transformer uniform addition tends to dilute information. This is consistent with the hypothesis that selective information flow in the depth direction is effective.
scaling law
In verification with 5 sizes of activation parameters from 194M to 528M, Block AttnRes achieves the same verification loss as baseline with 1.25 times less computation. From another perspective, performance increases when the same amount of calculation is used.
Post-training analysis reveals interesting patterns in the visualization of the learned attention weights for each layer.
- The weight to the immediately preceding layer is highest, but selective long-distance skip connections appear. The network spontaneously learns “skip references” that are impossible with normal residual connections.
- Layer specialization progresses with different reference patterns in pre-attention and pre-MLP. A division of labor has emerged in which the Attention layer refers to syntactic information in shallow layers, and the FFN layer refers to semantic information in deep layers.
It’s also interesting to see that AttnRes makes networks prefer “deep and thin” architectures. In architecture search under a fixed parameter budget, baseline was optimized around , while AttnRes’ optimum moved to around (deeper and thinner). By improving the flow of information in the depth direction, it is now possible to draw out the benefits of deep networks.