Qwen3-Coder-Next: A Local Coding Agent with 3B Active Parameters

Qwen3-Coder-Next, released by Alibaba’s Qwen team in February 2026, is intriguing. It’s an 80B-parameter model, yet only 3B are actually activated. Despite scoring over 70% on SWE-Bench Verified, it runs locally on a single RTX 4090 or even a MacBook Pro.

I’ve been looking for a coding agent that runs locally, so I’ve summarized the technical highlights.

Architecture

It’s based on a Mixture of Experts (MoE); the key specs are as follows.

Item	Value
Total parameters	80B
Active parameters	3B
Non-embedding parameters	79B
Layers	48 (hybrid layout)
Experts	512
Active experts	10
Shared experts	1
Hidden size	2048
Attention heads	Q:16, KV:2
Context length	256K
License	Apache 2.0

What stands out is the ratio of active to total parameters. While Kimi K2.5 activates 32B out of 1T, Qwen3-Coder-Next activates just 3B out of 80B. The parameter efficiency is dramatically higher.

Gated DeltaNet

A key bottleneck for long-context processing is that standard attention scales with the square of the input length. Qwen3-Coder-Next combines Gated DeltaNet (a linear attention with O(n) complexity) and conventional attention in a hybrid design.

As a result, even with 256K tokens, it avoids the typical quadratic slowdown. For agent tasks—where conversation history and tool-call outputs accumulate—this design makes sense.

Benchmarks

SWE-Bench Verified

Model	Score	Active parameters
Claude Opus 4.5	80.9%	Not disclosed
GPT-5.2	80.0%	Not disclosed
Claude Sonnet 4.5	77.2%	Not disclosed
Kimi K2.5	76.8%	32B
Qwen3-Coder-Next	70.6%	3B

Breaking 70% doesn’t match the very top models, but the point is that it does so with only 3B active parameters. It delivers a similar level of performance to models with 10–20× the parameters while using far less compute.

Real-world task evaluation

Results from 16x Engineer.

Task	Score
Markdown formatting (medium)	9.25/10
Folder watcher fix (regular)	8.75/10
Next.js TODO feature (easy)	8.0/10
Benchmark visualization (hard)	7.0/10
TypeScript type narrowing (special)	1.0/10
Overall average	6.8/10

It’s strong on standard mid-difficulty coding tasks, while niche patterns like TypeScript type narrowing are a weak spot. Among open-source models it outperforms DeepSeek V3, ranking just behind Kimi K2.

Running locally

Hardware requirements

With quantization, it runs on surprisingly practical hardware.

RTX 4090 (24GB VRAM): Works with Q4 quantization
MacBook Pro (sufficient RAM): Runs in GGUF format
Recommended: 64GB+ system RAM

Supported formats

Format	Use
Safetensors (BF16)	Full model
FP8	Quantized (reduced memory)
GGUF	For llama.cpp/Ollama

Running with Ollama

ollama run qwen3-coder-next

Running with llama.cpp

./llama-server -m Qwen3-Coder-Next.gguf -c 32768 -ngl 99

If you run short on memory, it’s recommended to reduce the context length to 32,768.

Agent integration

Supported platforms

Works with major coding assistants such as Claude Code, Qwen Code, Cline, Continue, and Cursor. It exposes an OpenAI-compatible API, so you can drop it into existing toolchains.

Start a server with vLLM

pip install 'vllm>=0.15.0'
vllm serve Qwen/Qwen3-Coder-Next \
  --port 8000 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Start a server with SGLang

pip install 'sglang[all]>=v0.5.8'
python -m sglang.launch_server \
  --model Qwen/Qwen3-Coder-Next \
  --port 30000 \
  --tp-size 2 \
  --tool-call-parser qwen3_coder

Tool calling

Supports OpenAI-compatible function calling.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key="EMPTY")

tools = [{
    "type": "function",
    "function": {
        "name": "read_file",
        "description": "ファイルの内容を読み取る",
        "parameters": {
            "type": "object",
            "required": ["path"],
            "properties": {
                "path": {"type": "string", "description": "ファイルパス"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="Qwen3-Coder-Next",
    messages=[{"role": "user", "content": "main.pyの内容を確認して"}],
    tools=tools,
    max_tokens=65536
)

Recommended sampling parameters

Temperature: 1.0
Top P: 0.95
Top K: 40

Notes

No “thinking” mode: <think></think> blocks are not generated. If you want to see the reasoning process, consider a different model.
Struggles with unusual patterns: Accuracy drops on less common code patterns such as TypeScript type narrowing.
UI generation is shaky: Formatting issues have been reported on visualization tasks.

Impressions

There’s still a gap versus paid models like the Claude 4 line and GPT-5 series, but as a local coding agent it’s among the strongest right now. Clearing 70% on SWE-Bench with just 3B active parameters is remarkable in terms of efficiency.

Not having to worry about API costs is a big plus. It’s a solid option if you don’t want to send private codebases out of your environment or need to develop offline.

That said, I’m currently using top-tier paid models—Claude Opus 4.5, GPT-5.2, and Gemini 2.5 Pro—so I’ll likely reach for this less often. With Claude Sonnet 5 on the way, I’d still like local LLMs to be around Sonnet 4.5 in capability. I may try some lighter tasks and compare.