Qwen3-Coder-Next: A Local Coding Agent with 3B Active Parameters
Qwen3-Coder-Next, released by Alibaba’s Qwen team in February 2026, is intriguing. It’s an 80B-parameter model, yet only 3B are actually activated. Despite scoring over 70% on SWE-Bench Verified, it runs locally on a single RTX 4090 or even a MacBook Pro.
I’ve been looking for a coding agent that runs locally, so I’ve summarized the technical highlights.
Architecture
It’s based on a Mixture of Experts (MoE); the key specs are as follows.
| Item | Value |
|---|---|
| Total parameters | 80B |
| Active parameters | 3B |
| Non-embedding parameters | 79B |
| Layers | 48 (hybrid layout) |
| Experts | 512 |
| Active experts | 10 |
| Shared experts | 1 |
| Hidden size | 2048 |
| Attention heads | Q:16, KV:2 |
| Context length | 256K |
| License | Apache 2.0 |
What stands out is the ratio of active to total parameters. While Kimi K2.5 activates 32B out of 1T, Qwen3-Coder-Next activates just 3B out of 80B. The parameter efficiency is dramatically higher.
Gated DeltaNet
A key bottleneck for long-context processing is that standard attention scales with the square of the input length. Qwen3-Coder-Next combines Gated DeltaNet (a linear attention with O(n) complexity) and conventional attention in a hybrid design.
As a result, even with 256K tokens, it avoids the typical quadratic slowdown. For agent tasks—where conversation history and tool-call outputs accumulate—this design makes sense.
Benchmarks
SWE-Bench Verified
| Model | Score | Active parameters |
|---|---|---|
| Claude Opus 4.5 | 80.9% | Not disclosed |
| GPT-5.2 | 80.0% | Not disclosed |
| Claude Sonnet 4.5 | 77.2% | Not disclosed |
| Kimi K2.5 | 76.8% | 32B |
| Qwen3-Coder-Next | 70.6% | 3B |
Breaking 70% doesn’t match the very top models, but the point is that it does so with only 3B active parameters. It delivers a similar level of performance to models with 10–20× the parameters while using far less compute.
Real-world task evaluation
Results from 16x Engineer.
| Task | Score |
|---|---|
| Markdown formatting (medium) | 9.25/10 |
| Folder watcher fix (regular) | 8.75/10 |
| Next.js TODO feature (easy) | 8.0/10 |
| Benchmark visualization (hard) | 7.0/10 |
| TypeScript type narrowing (special) | 1.0/10 |
| Overall average | 6.8/10 |
It’s strong on standard mid-difficulty coding tasks, while niche patterns like TypeScript type narrowing are a weak spot. Among open-source models it outperforms DeepSeek V3, ranking just behind Kimi K2.
Running locally
Hardware requirements
With quantization, it runs on surprisingly practical hardware.
- RTX 4090 (24GB VRAM): Works with Q4 quantization
- MacBook Pro (sufficient RAM): Runs in GGUF format
- Recommended: 64GB+ system RAM
Supported formats
| Format | Use |
|---|---|
| Safetensors (BF16) | Full model |
| FP8 | Quantized (reduced memory) |
| GGUF | For llama.cpp/Ollama |
Running with Ollama
ollama run qwen3-coder-next
Running with llama.cpp
./llama-server -m Qwen3-Coder-Next.gguf -c 32768 -ngl 99
If you run short on memory, it’s recommended to reduce the context length to 32,768.
Agent integration
Supported platforms
Works with major coding assistants such as Claude Code, Qwen Code, Cline, Continue, and Cursor. It exposes an OpenAI-compatible API, so you can drop it into existing toolchains.
Start a server with vLLM
pip install 'vllm>=0.15.0'
vllm serve Qwen/Qwen3-Coder-Next \
--port 8000 \
--tensor-parallel-size 2 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Start a server with SGLang
pip install 'sglang[all]>=v0.5.8'
python -m sglang.launch_server \
--model Qwen/Qwen3-Coder-Next \
--port 30000 \
--tp-size 2 \
--tool-call-parser qwen3_coder
Tool calling
Supports OpenAI-compatible function calling.
from openai import OpenAI
client = OpenAI(base_url='http://localhost:8000/v1', api_key="EMPTY")
tools = [{
"type": "function",
"function": {
"name": "read_file",
"description": "ファイルの内容を読み取る",
"parameters": {
"type": "object",
"required": ["path"],
"properties": {
"path": {"type": "string", "description": "ファイルパス"}
}
}
}
}]
response = client.chat.completions.create(
model="Qwen3-Coder-Next",
messages=[{"role": "user", "content": "main.pyの内容を確認して"}],
tools=tools,
max_tokens=65536
)
Recommended sampling parameters
- Temperature: 1.0
- Top P: 0.95
- Top K: 40
Notes
- No “thinking” mode:
<think></think>blocks are not generated. If you want to see the reasoning process, consider a different model. - Struggles with unusual patterns: Accuracy drops on less common code patterns such as TypeScript type narrowing.
- UI generation is shaky: Formatting issues have been reported on visualization tasks.
Impressions
There’s still a gap versus paid models like the Claude 4 line and GPT-5 series, but as a local coding agent it’s among the strongest right now. Clearing 70% on SWE-Bench with just 3B active parameters is remarkable in terms of efficiency.
Not having to worry about API costs is a big plus. It’s a solid option if you don’t want to send private codebases out of your environment or need to develop offline.
That said, I’m currently using top-tier paid models—Claude Opus 4.5, GPT-5.2, and Gemini 2.5 Pro—so I’ll likely reach for this less often. With Claude Sonnet 5 on the way, I’d still like local LLMs to be around Sonnet 4.5 in capability. I may try some lighter tasks and compare.