Accelerating LLM Inference: CDLM and Attention Matching KV Compaction
Two papers that answer the call to run LLMs cheaper and faster appeared almost at the same time. One is Together AI’s Consistency Diffusion Language Models (CDLM). The other is the MIT/Harvard team’s KV‑cache compression method, Attention Matching.
Tackling the speed problem in diffusion language models: CDLM
Autoregressive (AR) models are the mainstream for LLMs, but “diffusion language models (DLMs)” have recently emerged. DLMs simultaneously recover masked tokens from noise and can exploit bidirectional context. Dream and LLaDA are representative models. However, inference is far too slow. CDLM directly attacks this speed problem; it is available as arXiv:2511.19269.
Two reasons DLMs are slow
There are two inference bottlenecks in DLMs.
First, too many refinement steps. DLMs iteratively update masked token positions, and you need as many steps as the generation length. For example, with Dream‑7B generating 256 tokens, you pay for 256 full‑sequence recomputations.
Second, poor compatibility with KV caching. Thanks to causal masking, AR models can reuse cached results from past tokens. DLMs, however, use bidirectional attention. When a token is updated at a given step, the Key/Value of earlier tokens become invalid. Because you can’t use the cache, you must recompute the entire context every step—fatally slow.
CDLM design: block‑causal masking and consistency learning
CDLM addresses both issues. It uses a teacher–student framework that distills knowledge from a standard DLM with bidirectional attention (teacher) into a student model with block‑causal masking.
How block‑causal masking works
Split the target text into fixed‑length blocks (size B). Apply causal masking between blocks so each block can only attend to the prompt and to blocks that come before it; inside a block, keep bidirectional attention.
With this design, you can reuse the KV cache of completed blocks. During decoding of a new block, only tokens inside that block change; Keys/Values of past blocks remain valid. It’s the same principle as KV caching in AR models, but the unit is a “block” rather than a token.
Three training objectives
CDLM jointly optimizes three losses.
- Distillation loss: Using logits reconstructed from the teacher DLM’s (bidirectional) hidden states, make the student’s predictive distribution at newly unmasked positions match the teacher. This is white‑box distillation: pass the teacher’s final hidden states through
lm_headand minimize the KL divergence to the teacher’s distribution. - Consistency loss: Within a block, constrain the predictive distribution at yet‑masked positions to match between an intermediate state y and the block‑completed state y*. In other words, no matter which step you predict at inside the block, the prediction should match that at block completion. This enables quality comparable to the block‑completed state with fewer steps.
- DLM loss: The usual masked‑language‑modeling cross‑entropy against the ground‑truth text to preserve overall language‑modeling performance.
The final loss is the weighted sum L = w_distill * L_Distillation + w_cons * L_Consistency + w_dlm * L_DLM.
Trajectory collection
Build training data from the teacher DLM’s inference trajectories. For each prompt, have the teacher decode block by block and record the input states at each step and the final hidden states. Vary the temperature to generate multiple trajectories and secure diversity. In DLMs, the temperature also affects the order in which tokens get fixed, so data augmentation tends to be even more effective than for AR models.
Parallel decoding with confidence thresholds at inference time
At inference time, fix multiple high‑confidence tokens in parallel within each block. Thanks to the consistency loss, even intermediate states inside a block yield predictions close to those at block completion, so we don’t need to finalize one token at a time. This directly reduces the number of steps.
Benchmarks
Using Dream‑7B‑Instruct as the base, the paper reports the following (Table 1):
| Benchmark | Method | Latency | Steps | Score |
|---|---|---|---|---|
| GSM8K‑CoT | Dream‑7B (baseline) | 23.5s (x1.0) | 256.0 (x1.0) | 79.1 |
| GSM8K‑CoT | CDLM‑Dream | 2.1s (x11.2) | 44.1 (x5.8) | 78.8 |
| HumanEval‑Instruct | Dream‑7B (baseline) | 13.4s (x1.0) | 256.0 (x1.0) | 48.2 |
| HumanEval‑Instruct | CDLM‑Dream | 2.2s (x6.1) | 49.6 (x5.2) | 50.0 |
| MBPP‑Instruct | Dream‑7B (baseline) | 21.7s (x1.0) | 256.0 (x1.0) | 51.8 |
| MBPP‑Instruct | CDLM‑Dream | 1.5s (x14.5) | 33.2 (x7.7) | 53.0 |
On GSM8K‑CoT, latency improves 11.2× with almost no degradation (score 79.1 → 78.8). On MBPP‑Instruct, the score even improves (51.8 → 53.0) while achieving a 14.5× speedup. The benefit is especially large for coding tasks, likely because parallel token finalization within a block works well for structured text like code.
Compared with prior DLM accelerations (dLLM‑Cache, Fast‑dLLM), CDLM beats them in both latency and throughput. The gap versus Fast‑dLLM (parallel decoding + dual cache) is particularly notable, reflecting the advantage of full KV‑cache compatibility brought by block‑causal masking.
Limitations
The paper’s Discussion section notes a few constraints.
- CDLM is a post‑training method and requires collecting the teacher model’s inference trajectories in advance, which costs non‑trivial compute/time.
- Because block‑causal masking limits the range of bidirectional context, the student is theoretically less expressive than the fully bidirectional teacher.
- On the MATH benchmark, the score drops from 38.0 to 32.4; aggressive step reduction tends to hurt advanced mathematical reasoning.
- It is also applied to LLaDA‑8B, but the gains are smaller than on Dream‑7B, suggesting dependence on the base model.
Reference: Consistency Diffusion Language Models / arXiv:2511.19269
Compressing the KV cache more directly: Attention Matching
In a completely different direction from CDLM, the MIT/Harvard team’s Attention Matching (arXiv:2602.16284) reduces inference cost by compressing the KV‑cache memory bottleneck for AR models handling long contexts. Code is available at https://github.com/adamzweiger/compaction.
Problem setup for KV‑cache compression
Transformers cache each token’s Key/Value pair and reuse them in subsequent attention computations. As the context gets longer, cache size grows linearly; processing very long contexts (100K+ tokens) strains GPU memory.
Prior compression approaches fall into three groups.
- Token selection: H2O, SnapKV, PyramidKV, KVzip, etc. Keep only tokens deemed important by attention scores. Quality collapses quickly at high compression ratios.
- Token merging: e.g., CaM. Merge KV pairs of nearby tokens.
- Optimization in a latent space: e.g., Cartridges. Learn a compact KV cache end‑to‑end. Quality is high, but compressing a single context can take GPU‑hours.
“summarization can be highly lossy, substantially harming downstream performance” — reducing KV via text summarization changes attention distributions and hurts downstream accuracy. Cartridges can maintain quality even at 50× compaction, but gradient‑based optimization is too slow for practical use.
Why Keys and Values can be compressed separately
You don’t need to treat Keys and Values as an inseparable pair when compacting the KV cache. This stems from how attention is computed.
Attention proceeds in two stages:
- Score computation: Take the inner product of query Q and keys K to obtain attention scores and normalize with softmax. Values do not participate here.
- Output computation: Take the weighted sum of Values using the normalized attention weights to obtain the final output. Keys do not participate here.
Keys play an index‑like role that decides “which tokens to attend to,” while Values carry the content extracted from the attended locations. Although these two roles are connected serially in the computation graph, they are independent variables, so we can apply separate compression strategies.
This structural separation underlies Attention Matching’s decomposition into two subproblems for C_k and C_v. Decide C_k (and bias β) first to fix the attention weights, then solve for C_v by least squares — the sequential solution works because K and V are independent in the computation.
Compression tolerance is not symmetric between K and V
Beyond being separable, K and V differ markedly in how much compression they tolerate. Multiple 2024 papers report this phenomenon.
Key is robust to low‑precision quantization.
KIVI (arXiv:2402.02750) quantizes Keys to INT2 and Values to INT2, and observes that quantizing Keys degrades quality less than quantizing Values. KVQuant (arXiv:2401.18079) reports the same and shows per‑channel quantization works well for Keys; even INT4/INT2 causes relatively small perplexity increases.
The reason lies in the distributional properties of Keys. With RoPE (rotary positional embeddings), key vectors tend to have highly skewed ranges per channel. That seems hostile to quantization at first, but per‑channel quantization (a separate scale per channel) helps: within each channel the range is narrow, so few bits suffice. Also, attention scores are inner products before softmax; as long as ranking (which tokens are top) is preserved, softmax absorbs small absolute deviations.
Value is sensitive to compression.
By contrast, quantizing or compressing Values impacts output quality directly. Values are aggregated into the layer output via a weighted sum by the attention weights, so Value errors propagate directly; there’s no normalization like softmax to absorb them.
KIVI, KVQuant, and GEAR (arXiv:2403.05527) all report that reducing Value bit‑width hurts more than doing so for Keys. GEAR addresses this via a hybrid strategy: low‑rank approximation + uniformly quantized residual + explicit outlier preservation. QServe (arXiv:2405.04532) similarly uses INT4 for Keys while applying more cautious schemes to Values when quantizing the KV cache.
Connection to Attention Matching.
This asymmetry naturally aligns with Attention Matching’s three‑stage decomposition. Selecting C_k (step 1) and fitting bias β (step 2) reconstruct the attention distribution; here what matters is preserving the ranking. Fitting C_v (step 3) reconstructs the output values and directly approximates the layer output via least squares. Because Values map directly to quality, using a closed‑form least‑squares fit is well‑motivated.
Formulation of Attention Matching
Attention Matching casts compaction as satisfying two conditions: reproducing the attention output and preserving attention mass.
We want to replace the original KV cache (K, V) with a smaller (C_k, C_v), compressing T tokens down to t (t << T). For any query q, the attention output computed from the compacted KV should reproduce the output from the original KV.
We impose two conditions:
- Attention output matching: The local attention output (softmax‑weighted sum) computed from the compacted KV matches the output computed from the original KV.
- Attention mass matching: The sum of exp‑scores (Mass) over the compacted KV block matches that over the original KV block.
Why mass preservation? Because the compacted KV is concatenated with other KV blocks (recent conversation turns, future generated tokens, etc.). When we concatenate, the attention weight allocated to each block is determined by the relative mass of each block; if the mass shifts, the allocation to the compacted block is distorted.
Introduce a scalar bias
When t < T, Keys alone cannot satisfy mass matching. For example, with q = 0, Mass(0; K) = T but Mass(0; C_k) = t; no choice of C_k can make it T.
So we introduce a scalar bias β for each compacted key, modifying the attention score to q * C_k^T + β. Let β_j = log(w_j), where w_j indicates “how many original keys’ mass this compact key represents.” The memory overhead is tiny — just (2d+1)/(2d)× — and this is supported by PyTorch SDPA and FlexAttention.
Fast compaction with closed‑form solutions
You don’t need end‑to‑end gradient descent to perform compaction. Decompose the optimization into three subproblems and solve them in sequence.
- Choose compacted keys C_k: either select important keys from the originals or greedily pick them using Orthogonal Matching Pursuit (OMP).
- Fit the bias β: once C_k is fixed, the mass‑matching condition becomes a non‑negative least‑squares (NNLS) problem with A_ij = exp(q_i * C_k_j^T) and target vector m being the original mass. Solve the constrained least squares with w_j ≥ 0 and set β_j = log(w_j). This has a closed‑form solution.
- Fit compacted values C_v: the attention‑output‑matching condition reduces to ordinary least squares. With the compacted attention weights fixed, find C_v so that the weighted sum reproduces the original output. This also has a closed‑form solution.
They also propose several ways to obtain reference queries Q_ref:
- Repeat‑prefill: re‑input the context together with a “repeat” instruction and extract the queries that arise during reconstruction.
- Self‑study: generate synthetic QA on the context and use those query vectors.
- On‑policy: compress layers sequentially and extract queries at each layer while previous layers are already compacted, reducing distribution shift.
Variants
The paper evaluates four variants with different speed/quality trade‑offs.
- AM‑OMP: on‑policy queries (self‑study + repeat‑prefill), OMP for key selection, NNLS for bias, and least squares for values. Highest quality, slowest.
- AM‑OMP‑fast: a faster AM‑OMP — pick 4 keys per OMP iteration; refit bias every other iteration.
- AM‑HighestAttnKeys: on‑policy queries, select keys by top attention scores, then NNLS + LS. Fast.
- AM‑HighestAttnKeys‑fast: queries from repeat‑prefill only. Fastest.
Results
On Qwen3‑4B, Llama3.1‑8B, and Gemma3‑12B, they evaluate on QuALITY (long‑form QA, 5–8K tokens) and LongHealth (clinical QA, 60K tokens).
For 50× compaction on Qwen3‑4B (from Figure 1):
- AM‑OMP matches Cartridges in accuracy while being orders of magnitude faster to compact (minutes vs hours).
- Token‑selection baselines like H2O, SnapKV, and PyramidKV degrade heavily at 50×.
- Summarization baselines degrade similarly.
- AM methods form the Pareto frontier.
Varying compression ratio from 20× to 100× (Figure 3) shows the advantage of Attention Matching especially in the high‑compression regime (≥20×). On dense datasets like LongHealth all methods degrade earlier, but the relative ranking stays the same.
Breakdown of compaction time (Gemma3‑12B, 60K tokens, H200 GPU, from Table 1):
- context‑prefill: 7 s
- repeat‑prefill: 8 s
- self‑study: 139 s
- OMP compaction: minutes
Self‑study is the most time‑consuming, but even without it (AM‑HighestAttnKeys‑fast) you get practical quality.
Gemma3‑12B uses a hybrid architecture with sliding‑window attention (local in five layers, global in one). By compacting only the global‑attention layers you can obtain the same improvements.
Limitations
- You still need to prefill the entire target context once in the normal way. Compaction itself is fast, but prefill cost remains.
- Assumes RoPE positional embeddings have already been applied to the compacted keys. The compacted cache retains logical length T, which preserves the correct RoPE phase for subsequent tokens.
- Compute compacted keys/values in FP32 and cast to BF16. Combination with quantization is untested.
- Currently designed for offline (batch) compaction. Streaming/online compaction is only briefly discussed in the appendix and is not a primary evaluation target.
Paper: Fast KV Compaction via Attention Matching / Adam Zweiger, Xinghong Fu, Han Guo, Yoon Kim / February 18, 2026 / https://arxiv.org/abs/2602.16284