Tech 10 min read

MoonshotAI (Kimi) proposed AttnRes to replace Transformer's residual connection with Attention, 1.25 times more computationally efficient.

IkesanContents

On March 16, 2026, the MoonshotAI (Kimi) team published Attention Residuals (AttnRes), a method that replaces the residual connections of Transformer with attention in the depth direction, on arXiv (arXiv:2603.15031). The implementation is published on GitHub.

What is residual connection?

The expressive power of a neural network improves as the layers become deeper, but if the layers are made too deep, “vanishing gradient” (a phenomenon in which the learning signal disappears in intermediate layers) occurs, and performance decreases. Residual connection (skip connection) was proposed in ResNet (residual network) in 2015, and it is a simple mechanism that adds input directly to the processing results of each layer.

graph TD
    IN[層の入力] --> ATT[Attention<br/>or FFN]
    ATT --> ADD((+))
    IN -->|残差接続| ADD
    ADD --> OUT[出力]

Written as a mathematical formula, it is h_l = h_{l-1} + f(h_{l-1}). f is the processing of each layer (Attention and FFN), and the input h_{l-1} is added to its output as is. This allows gradients to reach deep layers through “shortcuts,” making it possible to learn stably even in networks with more than 100 layers. The current Transformer uses this residual connection in all layers.

What happens in a normal Transformer

Residual connection was a breakthrough in deep learning, but let’s summarize how a normal Transformer handles information.

Let’s consider a 48-layer Transformer. Once the input token embedding vector enters the first layer, the following is repeated for each layer:

  1. receive the output of the previous layer
  2. Process with Attention (or FFN)
  3. Add processing result + output of previous layer as is (residual connection)
  4. pass to next layer

Point number 3. Since all layers use fixed addition with a weight of 1, the output of the 48th layer is an even mixture of the processing results of the 48 layers. There is no mechanism for each layer to decide whether information on this layer is important'' or information on this layer is unnecessary.”

graph TD
    subgraph 通常のTransformer
    X0[入力埋め込み] --> L1[Layer 1]
    L1 -->|"+1"| A1((+))
    X0 -->|"+1"| A1
    A1 --> L2[Layer 2]
    L2 -->|"+1"| A2((+))
    A1 -->|"+1"| A2
    A2 --> L3[Layer 3 ...]
    L3 -->|"+1"| A3((+))
    A2 -->|"+1"| A3
    A3 --> OUT1[最終出力]
    end

This structure is the same for both CNN for image recognition and Transformer for language models, and has remained virtually unchanged since ResNet in 2015.

Dilution problem

What exactly does equal addition cause?

For example, suppose Layer 30 learns that “the type information of the programming language is important in this context” and sends out a strong signal. However, until the final output is reached, the outputs from the 31st to 48th layers are added with the same weight. The signal in the 30th layer is diluted by 18 additions.

There is also a problem in the opposite direction, where the low-level syntactic information issued by the fifth layer remains with the same weight until the end. At the stage where the 48th layer is performing advanced inference, unnecessary information from the initial layer is mixed in as noise.

This is dilution, and the norm of the hidden state increases in O(L) in proportion to the number of layers L. This is especially noticeable in modern LLM, where PreNorm (a method of applying LayerNorm before Attention/FFN) has become standard, and the relative contribution of each layer becomes smaller and smaller.

Fundamental differences between regular deep learning and AttnRes

In normal deep learning (all ResNet-based architectures), information propagation between layers is statically hard-coded. If the designer decides to add the output of the previous layer with a weight of 1, the distribution will not change no matter what input is received. Whether it’s a math problem, code generation, or translation, information is mixed in exactly the same even distribution.

Regular deep learningAttnRes
Information propagation between layersFixed (addition of weight 1)Dynamic (softmax attention)
ReferenceOnly the previous layerAll previous layers
Adaptation to inputNoneWeights change for each input
Unnecessary layer informationAccumulates as isCan be suppressed by lowering the weight
Information from distant layersCan only be reached via the middle layerCan be directly referenced

AttnRes dynamically determines “which previous layer’s information to use and how much” according to the input. For code generation tasks, it is possible to increase the weight of layers that are strong in syntax analysis, and for mathematical inference, it is possible to increase the weight of layers that capture logical structure.

Inspiration from the success of Transformer

There is an important analogy that the paper points out. Transformer successfully replaces RNN’s fixed weight aggregation with softmax attention in the time series direction.

Fixed weight aggregationAttention aggregation
Time series directionRNN (inherits previous step with fixed weight)Transformer (see selection of all times with softmax)
Depth directionResidual connection (add previous layer with weight 1)AttnRes (select all previous layers with softmax)

This is the same structure as how Transformer solved the problem of RNN’s inability to retain long-distance information using attention. The idea behind AttnRes is to use attention to overcome the limits of fixed weights even in the depth direction.

How Attention Residuals works

While normal residual connection adds the outputs of the previous layer equally,'' AttnRes combines the outputs of all previous layers by weighting them with softmax attention.”

hl=i=0l1αilvih_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_i

viv_i is the output of the ii layer, and αil\alpha_{i \to l} is the softmax normalized attention weight. In normal residual connection, all α\alpha are equal, but they are replaced with learnable weights.

graph TD
    subgraph AttnRes
    Y0[入力埋め込み] --> M1[Layer 1]
    M1 --> S1{softmax<br/>attention}
    Y0 -->|"α₀→₁"| S1
    S1 --> M2[Layer 2]
    M2 --> S2{softmax<br/>attention}
    Y0 -->|"α₀→₂"| S2
    M1 -->|"α₁→₂"| S2
    S2 --> M3[Layer 3 ...]
    M3 --> S3{softmax<br/>attention}
    Y0 -->|"α₀→₃"| S3
    M1 -->|"α₁→₃"| S3
    M2 -->|"α₂→₃"| S3
    S3 --> OUT2[最終出力]
    end

The calculation formula for attention weight is as follows.

αil=ϕ(wl,ki)j=0l1ϕ(wl,kj),ϕ(q,k)=exp(qRMSNorm(k))\alpha_{i \to l} = \frac{\phi(w_l, k_i)}{\sum_{j=0}^{l-1} \phi(w_l, k_j)}, \quad \phi(q,k) = \exp(q^\top \text{RMSNorm}(k))

There are two important design choices.

Query is a fixed vector wlw_l (input independent) for each layer. In experiments using input-dependent queries, the verification loss is slightly improved (1.731 vs. 1.737), but sequential memory access occurs during decoding. Fixed queries were adopted as a trade-off with throughput.

Apply RMSNorm to kk. Prevent layers with large magnitude from monopolizing attentional weight. If this is removed, the loss becomes worse (1.743).

Comparison with past methods

The characteristics of each method can be summarized by expressing the aggregation in the depth direction as a matrix M.

MethodType of weightReference range
Residual connection (standard)Fixed (equal)Only the previous layer
DenseFormerStatic scalarFull front layer
mHCDynamic (input dependent)m-stream
AttnRes FullDynamic (softmax)Full front layer
AttnRes BlockDynamic (softmax)Previous layer aggregated in blocks

DenseFormer has a configuration that references all previous layers but the weights are fixed'', and in the ablation experiment, the loss (1.767) was the same as the baseline loss (1.766). The fact that static weights (fixed regardless of input) has no effect is the basis for the conclusion that input-dependent dynamic weighting is essential. This is why the fixed allocation determined at the time of design” of normal deep learning had its limits in the first place.

Block AttnRes: Practical efficiency

The problem with Full AttnRes is that the memory cost of holding the outputs of all previous layers and the pipeline parallel communication cost increase in proportion to the depth. Block AttnRes solves this.

Divide the L layer into N blocks (each block S=L/N layer) and let each block be represented by the sum of its outputs bnb_n. As a result, memory and communication costs are reduced from O(Ld) to O(Nd).

According to experiments, performance almost equivalent to Full AttnRes can be obtained with N≈8 blocks.

Optimization during training

Pipeline parallel learning requires each stage to receive block representations from the previous stage. The paper introduced cross-stage caching, a design in which each physical stage caches the block representation of the previous virtual stage.

In a naive implementation, the amount of communication increases in proportion to the cumulative number of pipeline chunks C (O(C²)), but after caching, it is compressed to the number of pipeline physical stages P (O(P²)), resulting in an improvement of V times. In steady state (1F1B schedule) it can completely overlap with computation, so actual training overhead is less than 4%.

Being able to use the regular Transformer training pipeline almost as is is of great practical importance. Changes to the learning code are minimal.

Optimization during inference

When decoding, Two-Phase Computation is used.

PhasesProcessingPurpose
Phase 1 (parallel)Batch calculation of inter-block attention in all S layersConsolidation of memory accesses into one time
Phase 2 (sequential)Calculate intra-block attention in each layer and combine with Phase 1 results using online softmaxInput-dependent fine-grained calculation

By reducing the number of reads of the block representation from S times to 1 time, the inference latency increased by less than 2%. It maintains almost the same inference speed as normal Transformer.

Prefilling a long context (128K tokens) requires 15GB to hold the block representation, but this can be reduced to less than 0.3GB by combining tensor-parallel sharding and chunked prefilling (16K chunks).

Demonstration on Kimi Linear 48B

Results of integrating Block AttnRes into Kimi Linear architecture (48B total parameters / 3B activations, similar to DeepSeek-V3 design with MoE) and pretraining with 1.4T tokens.

BenchmarkBaselineAttnResImprovement
GPQA-Diamond36.944.4+7.5
Math53.557.1+3.6
HumanEval59.162.2+3.1
MMLU73.574.6+1.1
BBH76.378.0+1.7
C-Eval79.682.5+2.9
CMMLU82.082.9+0.9

The improvements are particularly significant in multi-step inference (GPQA-Diamond) and code generation (HumanEval). These tasks require accurate reference to abstract representations built in deep layers, and are areas where normal Transformer uniform addition tends to dilute information. This is consistent with the hypothesis that selective information flow in the depth direction is effective.

scaling law

In verification with 5 sizes of activation parameters from 194M to 528M, Block AttnRes achieves the same verification loss as baseline with 1.25 times less computation. From another perspective, performance increases when the same amount of calculation is used.

Post-training analysis reveals interesting patterns in the visualization of the learned attention weights for each layer.

  • The weight to the immediately preceding layer is highest, but selective long-distance skip connections appear. The network spontaneously learns “skip references” that are impossible with normal residual connections.
  • Layer specialization progresses with different reference patterns in pre-attention and pre-MLP. A division of labor has emerged in which the Attention layer refers to syntactic information in shallow layers, and the FFN layer refers to semantic information in deep layers.

It’s also interesting to see that AttnRes makes networks prefer “deep and thin” architectures. In architecture search under a fixed parameter budget, baseline was optimized around dmodel/Lb60d_\text{model}/L_b \approx 60, while AttnRes’ optimum moved to around 45\approx 45 (deeper and thinner). By improving the flow of information in the depth direction, it is now possible to draw out the benefits of deep networks.

reference