Qwen3.6-35B-A3B pairs Gated DeltaNet with MoE and raises the bar on agentic coding
Contents
On April 14, 2026, Alibaba’s Qwen team released Qwen3.6-35B-A3B under the Apache 2.0 license as open weights.
It keeps the previous MoE recipe of 35B total parameters with 3B active during inference, but swaps the attention stack for a Gated DeltaNet / Gated Attention hybrid.
What changed in the architecture
Qwen3.5-35B-A3B, which I covered in the earlier llama-server benchmark post, was an SSM + Attention hybrid with only 10 Attention layers out of 40.
In Qwen3.6, the SSM portion is replaced by Gated DeltaNet, a variant of linear attention.
The 40-layer layout is this.
Ten four-layer blocks stacked in sequence. Three Gated DeltaNet layers plus one Gated Attention layer (standard transformer-style attention) per block.
Component specs:
| Component | Parameters |
|---|---|
| Gated DeltaNet heads | Q/K: 16, V: 32 (head dim 128) |
| Gated Attention heads | Q: 16, KV: 2 (GQA, head dim 256) |
| MoE experts | 256 (8 routed + 1 shared) |
| Expert intermediate dim | 512 |
| Hidden dim | 2048 |
The expert count stays at 256, but now there is 1 shared expert always on top of the 8 routed experts.
Every token hits a common pool of knowledge, and then gets routed to 8 of the 256 experts.
Attention, briefly
Attention is the mechanism that dynamically decides which tokens in the context the current token should look at, and how much. Given a (query) derived from the current token and a (key) for each other token, it computes relevance scores and uses them to mix the (value) vectors into the next representation.
Standard transformers implement this with softmax attention. Each token looks at every other token precisely, which is accurate but blows up in compute and memory as the context grows.
Linear attention rewrites this as approximation plus state updates, so long contexts no longer explode. It is less precise than softmax attention, but the design trades precision for efficiency at long context.
The catch is that a plain linear attention just accumulates information into a state. It has no real control over what to keep and what to forget. That is the gap DeltaNet-style methods fill.
What Gated DeltaNet actually is
Gated DeltaNet is linear attention plus a gating mechanism on top of DeltaNet. Where plain linear attention keeps adding information into the state, DeltaNet-style updates write “the difference between the current prediction and the new value.”
On top of that, Gated DeltaNet can tune how much of the past state to keep and how aggressively to write new information. When the context shifts, old information gets attenuated while still-relevant information stays.
But for cases where you really want precise attention across the whole context, Gated Attention steps in. Qwen3.6 mixes the two at a 4:1 ratio to balance efficiency at long context with local precision.
The KV cache savings we saw with the Mamba-style SSM in Qwen3.5 carry over to Gated DeltaNet.
With Qwen3.5, only 10 out of 40 layers used KV cache; going from ctx-size 4096 to 65536 only added 800 MB of VRAM.
Qwen3.6 also has Gated Attention on just 10 of 40 layers, so KV cache stays in the same ballpark.
Benchmarks
First, Qwen3.5-35B-A3B vs Qwen3.6-35B-A3B.
| Benchmark | Qwen3.5-35B-A3B | Qwen3.6-35B-A3B |
|---|---|---|
| SWE-bench Verified | 70.0 | 73.4 |
| SWE-bench Multilingual | 60.3 | 67.2 |
| SWE-bench Pro | 44.6 | 49.5 |
| Terminal-Bench 2.0 | 40.5 | 51.5 |
| QwenClawBench | 47.7 | 52.6 |
| NL2Repo | 20.5 | 29.4 |
| QwenWebBench | 978 | 1397 |
Terminal-Bench 2.0 is up 27% and QwenWebBench is up 43%. The agentic-coding side of the benchmark suite sees the biggest gains.
Against Gemma 4-31B, MCPMark (tool use) is 37.0 vs 18.1, more than double.
SWE-bench Verified is 73.4 vs 52.0. A 3B-active MoE is beating an always-on 31B dense model by a clear margin.
The MCP-first agent design shows up directly in the numbers.
Knowledge and reasoning scores, for reference:
| Benchmark | Score |
|---|---|
| MMLU-Pro | 85.2 |
| GPQA Diamond | 86.0 |
| AIME 2026 | 92.7 |
| MMMU | 81.7 |
It is also multimodal. Image, video, and document processing benchmarks are reported (OmniDocBench 1.5: 89.9, VideoMMU: 83.7).
Thinking Preservation
A new feature in Qwen3.6 is Thinking Preservation.
It lets the model carry thinking tokens across turns in a multi-turn conversation.
Normal LLMs re-derive everything from scratch each turn.
Qwen3.6 keeps the reasoning from the previous assistant turn in context and reuses it in the next turn.
For long coding sessions or repo-wide refactors, you can keep going without throwing away the intermediate reasoning.
from openai import OpenAI
client = OpenAI()
chat_response = client.chat.completions.create(
model="Qwen/Qwen3.6-35B-A3B",
messages=messages,
max_tokens=32768,
temperature=0.7,
top_p=0.8,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"preserve_thinking": True},
},
)
Pass preserve_thinking: True and it’s on. The same flag works through the agent framework (Qwen-Agent) when combined with MCP servers.
Serving
Recommended frameworks are SGLang (≥0.5.10) and vLLM (≥0.19.0). SGLang deployment example:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
--tool-call-parser qwen3_coder handles MCP tool call parsing.
Native context length is 262,144 tokens, and YaRN scaling can push it up to roughly 1M tokens.
Multi-Token Prediction (MTP) is also supported. SGLang speculative decoding example:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-35B-A3B \
--port 8000 \
--tp-size 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
Rough hardware requirements:
| Precision | VRAM |
|---|---|
| FP16/BF16 | ~70GB |
| INT8 quantization | ~35GB |
| INT4 GPTQ | ~18GB |
| GGUF (CPU inference) | ~20GB RAM |
Techniques like GGUF on AMD ROCm and SSD-streamed inference à la Flash-MoE should translate to Qwen3.6 in the same direction.
Gated DeltaNet kernels are new, though, so backend support needs to be checked case by case.
Qwen-Agent and MCP integration
There’s an official integration example with the Qwen-Agent library for agentic use cases.
Pass MCP servers into function_list and tool calling just works.
import os
from qwen_agent.agents import Assistant
llm_cfg = {
'model': 'Qwen3.6-35B-A3B',
'model_type': 'qwenvl_oai',
'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': os.getenv('DASHSCOPE_API_KEY'),
'generate_cfg': {
'use_raw_api': True,
'extra_body': {
'enable_thinking': True,
'preserve_thinking': True,
},
},
}
tools = [
{'mcpServers': {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/workspace"]
}
}}
]
bot = Assistant(llm=llm_cfg, function_list=tools)
Passing both enable_thinking: True and preserve_thinking: True is the officially recommended setup.
Sampling parameters
Precise coding tasks and general tasks have different recommended values.
| Mode | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking (general) | 1.0 | 0.95 | 20 | 1.5 |
| Thinking (precise coding) | 0.6 | 0.95 | 20 | 0.0 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 1.5 |
For precise coding or competitive programming, temperature=0.6 and presence_penalty=0.0 gives the most deterministic output.
The math behind linear attention
From here I’ll go through the math. It helps explain where Qwen3.6’s efficiency comes from.
The O(n²) problem in softmax attention
Standard transformer softmax attention uses compute and memory for a context length .
Here is the side doing the lookup, is what’s looked up, and is what gets pulled out.
With as matrices, produces an attention matrix internally.
It’s quadratic in context length, which has been the long-standing bottleneck.
Going O(n) with linear attention
Linear attention approximates softmax with a kernel feature map and swaps the order of matrix multiplications.
Original
Approximation
Compute first and you get a state matrix; then for each query, you just multiply by it.
No attention matrix is ever built, so compute drops to . The longer the context, the bigger the gap.
The tradeoff is that dropping softmax normalization costs precision, but scaling stays linear in sequence length.
Rewrite the same equation as a recurrence over time, and the RNN-like structure of linear attention becomes visible.
Per token, accumulate the outer product of and into the state, then multiply by the query to get the output.
Structurally, linear attention is equivalent to a linear RNN that writes rank-1 updates into its state.
This is also why “no KV cache is needed” holds: the state matrix is fixed-size at .
The overwrite mechanism DeltaNet added
Plain linear attention has a weakness. Since it only “adds” to the state matrix, there’s no way to forget.
As more tokens come in, the state fills up, and old memories get crushed by new ones.
DeltaNet replaced that addition with the delta rule, the same one used in perceptron learning.
It writes only the difference between the new value and the prediction from the current state, so the state stays consistent while being updated.
DeltaNet’s contribution is that it keeps the linear-attention compute while acting like an associative memory.
Gated DeltaNet: adding a forget gate
Gated DeltaNet extends DeltaNet with a forget gate . The update roughly looks like this.
- Small attenuates the past state; large preserves it
- controls how aggressively new information is written
So at a per-token level, you can “forget the past when the context shifts” and “keep information that’s still useful.”
Qwen3.6 pairs this with Gated Attention at 4:1 because linear approximation tends to lose precise long-range dependencies, and softmax attention covers that.
For tasks that juggle both local syntactic references and wide repo-level context — coding is the obvious example — this mix earns its keep.
Difference from Mamba-style SSMs
The Mamba-style SSM used up to Qwen3.5 also updates a state-space via recurrence, which looks similar to Gated DeltaNet.
But SSMs are discretizations of continuous-time linear systems. Their state transitions are mostly linear transforms by a matrix .
Gated DeltaNet uses outer products via the delta rule to do rank-1 state updates, which makes associative memory writes and rewrites explicit.
Both can handle long context without a KV cache. The likely reason Qwen3.6 switched away from SSM is that the “correctly reference past instructions and intermediate results” capability that matters for agentic coding is easier to get out of delta-rule-based updates.