Qwen3.6-35B-A3B pairs Gated DeltaNet with MoE and raises the bar on agentic coding

Update (2026-04-21): Ollama’s official library added qwen3.6:35b, so I tried it on an M1 Max 64GB and wrote up the results — speed comparison, multi-turn, and a three-tier NSFW probe. → I Ran Qwen3.6-35B-A3B on M1 Max via Ollama and Thinking Tokens Ballooned 13×

On April 14, 2026, Alibaba’s Qwen team released Qwen3.6-35B-A3B under the Apache 2.0 license as open weights.
It keeps the previous MoE recipe of 35B total parameters with 3B active during inference, but swaps the attention stack for a Gated DeltaNet / Gated Attention hybrid.

What changed in the architecture

Qwen3.5-35B-A3B, which I covered in the earlier llama-server benchmark post, was an SSM + Attention hybrid with only 10 Attention layers out of 40.
In Qwen3.6, the SSM portion is replaced by Gated DeltaNet, a variant of linear attention.

The 40-layer layout is this.

10 \times \bigl(\,3 \times (\text{Gated DeltaNet} \to \text{MoE}) \;\to\; 1 \times (\text{Gated Attention} \to \text{MoE})\,\bigr)

Ten four-layer blocks stacked in sequence. Three Gated DeltaNet layers plus one Gated Attention layer (standard transformer-style attention) per block.

Component specs:

Component	Parameters
Gated DeltaNet heads	Q/K: 16, V: 32 (head dim 128)
Gated Attention heads	Q: 16, KV: 2 (GQA, head dim 256)
MoE experts	256 (8 routed + 1 shared)
Expert intermediate dim	512
Hidden dim	2048

The expert count stays at 256, but now there is 1 shared expert always on top of the 8 routed experts.
Every token hits a common pool of knowledge, and then gets routed to 8 of the 256 experts.

Attention, briefly

Attention is the mechanism that dynamically decides which tokens in the context the current token should look at, and how much. Given a $Q$ (query) derived from the current token and a $K$ (key) for each other token, it computes relevance scores and uses them to mix the $V$ (value) vectors into the next representation.

Standard transformers implement this with softmax attention. Each token looks at every other token precisely, which is accurate but blows up in compute and memory as the context grows.

Linear attention rewrites this as approximation plus state updates, so long contexts no longer explode. It is less precise than softmax attention, but the design trades precision for efficiency at long context.

The catch is that a plain linear attention just accumulates information into a state. It has no real control over what to keep and what to forget. That is the gap DeltaNet-style methods fill.

What Gated DeltaNet actually is

Gated DeltaNet is linear attention plus a gating mechanism on top of DeltaNet. Where plain linear attention keeps adding information into the state, DeltaNet-style updates write “the difference between the current prediction and the new value.”

On top of that, Gated DeltaNet can tune how much of the past state to keep and how aggressively to write new information. When the context shifts, old information gets attenuated while still-relevant information stays.

But for cases where you really want precise attention across the whole context, Gated Attention steps in. Qwen3.6 mixes the two at a 4:1 ratio to balance efficiency at long context with local precision.

The KV cache savings we saw with the Mamba-style SSM in Qwen3.5 carry over to Gated DeltaNet.
With Qwen3.5, only 10 out of 40 layers used KV cache; going from ctx-size 4096 to 65536 only added 800 MB of VRAM.
Qwen3.6 also has Gated Attention on just 10 of 40 layers, so KV cache stays in the same ballpark.

Benchmarks

First, Qwen3.5-35B-A3B vs Qwen3.6-35B-A3B.

Benchmark	Qwen3.5-35B-A3B	Qwen3.6-35B-A3B
SWE-bench Verified	70.0	73.4
SWE-bench Multilingual	60.3	67.2
SWE-bench Pro	44.6	49.5
Terminal-Bench 2.0	40.5	51.5
QwenClawBench	47.7	52.6
NL2Repo	20.5	29.4
QwenWebBench	978	1397

Terminal-Bench 2.0 is up 27% and QwenWebBench is up 43%. The agentic-coding side of the benchmark suite sees the biggest gains.

Against Gemma 4-31B, MCPMark (tool use) is 37.0 vs 18.1, more than double.
SWE-bench Verified is 73.4 vs 52.0. A 3B-active MoE is beating an always-on 31B dense model by a clear margin.
The MCP-first agent design shows up directly in the numbers.

Knowledge and reasoning scores, for reference:

Benchmark	Score
MMLU-Pro	85.2
GPQA Diamond	86.0
AIME 2026	92.7
MMMU	81.7

It is also multimodal. Image, video, and document processing benchmarks are reported (OmniDocBench 1.5: 89.9, VideoMMU: 83.7).

Thinking Preservation

A new feature in Qwen3.6 is Thinking Preservation.
It lets the model carry thinking tokens across turns in a multi-turn conversation.

Normal LLMs re-derive everything from scratch each turn.
Qwen3.6 keeps the reasoning from the previous assistant turn in context and reuses it in the next turn.
For long coding sessions or repo-wide refactors, you can keep going without throwing away the intermediate reasoning.

from openai import OpenAI

client = OpenAI()

chat_response = client.chat.completions.create(
    model="Qwen/Qwen3.6-35B-A3B",
    messages=messages,
    max_tokens=32768,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        "top_k": 20,
        "chat_template_kwargs": {"preserve_thinking": True},
    },
)

Pass preserve_thinking: True and it’s on. The same flag works through the agent framework (Qwen-Agent) when combined with MCP servers.

Serving

Recommended frameworks are SGLang (≥0.5.10) and vLLM (≥0.19.0). SGLang deployment example:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tp-size 8 \
  --mem-fraction-static 0.8 \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder

--tool-call-parser qwen3_coder handles MCP tool call parsing.
Native context length is 262,144 tokens, and YaRN scaling can push it up to roughly 1M tokens.

Multi-Token Prediction (MTP) is also supported. SGLang speculative decoding example:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.6-35B-A3B \
  --port 8000 \
  --tp-size 8 \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --speculative-algo NEXTN \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4

Rough hardware requirements:

Precision	VRAM
FP16/BF16	~70GB
INT8 quantization	~35GB
INT4 GPTQ	~18GB
GGUF (CPU inference)	~20GB RAM

Techniques like GGUF on AMD ROCm and SSD-streamed inference à la Flash-MoE should translate to Qwen3.6 in the same direction.
Gated DeltaNet kernels are new, though, so backend support needs to be checked case by case.

Qwen-Agent and MCP integration

There’s an official integration example with the Qwen-Agent library for agentic use cases.
Pass MCP servers into function_list and tool calling just works.

import os
from qwen_agent.agents import Assistant

llm_cfg = {
    'model': 'Qwen3.6-35B-A3B',
    'model_type': 'qwenvl_oai',
    'model_server': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
    'api_key': os.getenv('DASHSCOPE_API_KEY'),
    'generate_cfg': {
        'use_raw_api': True,
        'extra_body': {
            'enable_thinking': True,
            'preserve_thinking': True,
        },
    },
}

tools = [
    {'mcpServers': {
        "filesystem": {
            "command": "npx",
            "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/workspace"]
        }
    }}
]

bot = Assistant(llm=llm_cfg, function_list=tools)

Passing both enable_thinking: True and preserve_thinking: True is the officially recommended setup.

Sampling parameters

Precise coding tasks and general tasks have different recommended values.

Mode	temperature	top_p	top_k	presence_penalty
Thinking (general)	1.0	0.95	20	1.5
Thinking (precise coding)	0.6	0.95	20	0.0
Non-thinking (general)	0.7	0.8	20	1.5

For precise coding or competitive programming, temperature=0.6 and presence_penalty=0.0 gives the most deterministic output.

The math behind linear attention

From here I’ll go through the math. It helps explain where Qwen3.6’s efficiency comes from.

The O(n²) problem in softmax attention

Standard transformer softmax attention uses $O(L^2)$ compute and memory for a context length $L$ .
Here $Q$ is the side doing the lookup, $K$ is what’s looked up, and $V$ is what gets pulled out.
With $Q, K, V$ as $L \times d$ matrices, $\mathrm{Softmax}(QK^\top)V$ produces an $L \times L$ attention matrix internally.
It’s quadratic in context length, which has been the long-standing bottleneck.

Going O(n) with linear attention

Linear attention approximates softmax with a kernel feature map $\phi$ and swaps the order of matrix multiplications.

Original

\mathrm{Softmax}(Q K^\top)\, V

Approximation

\phi(Q) \bigl(\phi(K)^\top V\bigr)

Compute $\phi(K)^\top V$ first and you get a $d \times d$ state matrix; then for each query, you just multiply by it.
No $L \times L$ attention matrix is ever built, so compute drops to $O(L d^2)$ . The longer the context, the bigger the gap.
The tradeoff is that dropping softmax normalization costs precision, but scaling stays linear in sequence length.

Rewrite the same equation as a recurrence over time, and the RNN-like structure of linear attention becomes visible.

\begin{aligned} S_t &= S_{t-1} + v_t\, k_t^\top \\ o_t &= S_t\, q_t \end{aligned}

Per token, accumulate the outer product of $k_t$ and $v_t$ into the state, then multiply by the query to get the output.
Structurally, linear attention is equivalent to a linear RNN that writes rank-1 updates into its state.
This is also why “no KV cache is needed” holds: the state matrix $S_t$ is fixed-size at $d \times d$ .

The overwrite mechanism DeltaNet added

Plain linear attention has a weakness. Since it only “adds” to the state matrix, there’s no way to forget.
As more tokens come in, the state fills up, and old memories get crushed by new ones.

DeltaNet replaced that addition with the delta rule, the same one used in perceptron learning.

S_t = S_{t-1} + \beta_t\, (v_t - S_{t-1}\, k_t)\, k_t^\top

It writes only the difference between the new value $v_t$ and the prediction $S_{t-1}\, k_t$ from the current state, so the state stays consistent while being updated.
DeltaNet’s contribution is that it keeps the $O(n)$ linear-attention compute while acting like an associative memory.

Gated DeltaNet: adding a forget gate

Gated DeltaNet extends DeltaNet with a forget gate $\alpha_t$ . The update roughly looks like this.

S_t = \alpha_t\, S_{t-1} + \beta_t\, (v_t - \alpha_t\, S_{t-1}\, k_t)\, k_t^\top

Small $\alpha_t$ attenuates the past state; large $\alpha_t$ preserves it
$\beta_t$ controls how aggressively new information is written

So at a per-token level, you can “forget the past when the context shifts” and “keep information that’s still useful.”
Qwen3.6 pairs this with Gated Attention at 4:1 because linear approximation tends to lose precise long-range dependencies, and softmax attention covers that.
For tasks that juggle both local syntactic references and wide repo-level context — coding is the obvious example — this mix earns its keep.

Difference from Mamba-style SSMs

The Mamba-style SSM used up to Qwen3.5 also updates a state-space via recurrence, which looks similar to Gated DeltaNet.
But SSMs are discretizations of continuous-time linear systems. Their state transitions are mostly linear transforms by a matrix $A$ .
Gated DeltaNet uses outer products via the delta rule to do rank-1 state updates, which makes associative memory writes and rewrites explicit.

Both can handle long context without a KV cache. The likely reason Qwen3.6 switched away from SSM is that the “correctly reference past instructions and intermediate results” capability that matters for agentic coding is easier to get out of delta-rule-based updates.