Tech 5 min read

Qwen3-Coder-Next: A Local Coding Agent with 3B Active Parameters

Qwen3-Coder-Next, released by Alibaba’s Qwen team in February 2026, is intriguing. It’s an 80B-parameter model, yet only 3B are actually activated. Despite scoring over 70% on SWE-Bench Verified, it runs locally on a single RTX 4090 or even a MacBook Pro.

I’ve been looking for a coding agent that runs locally, so I’ve summarized the technical highlights.

Architecture

It’s based on a Mixture of Experts (MoE); the key specs are as follows.

ItemValue
Total parameters80B
Active parameters3B
Non-embedding parameters79B
Layers48 (hybrid layout)
Experts512
Active experts10
Shared experts1
Hidden size2048
Attention headsQ:16, KV:2
Context length256K
LicenseApache 2.0

What stands out is the ratio of active to total parameters. While Kimi K2.5 activates 32B out of 1T, Qwen3-Coder-Next activates just 3B out of 80B. The parameter efficiency is dramatically higher.

Gated DeltaNet

A key bottleneck for long-context processing is that standard attention scales with the square of the input length. Qwen3-Coder-Next combines Gated DeltaNet (a linear attention with O(n) complexity) and conventional attention in a hybrid design.

As a result, even with 256K tokens, it avoids the typical quadratic slowdown. For agent tasks—where conversation history and tool-call outputs accumulate—this design makes sense.

Benchmarks

SWE-Bench Verified

ModelScoreActive parameters
Claude Opus 4.580.9%Not disclosed
GPT-5.280.0%Not disclosed
Claude Sonnet 4.577.2%Not disclosed
Kimi K2.576.8%32B
Qwen3-Coder-Next70.6%3B

Breaking 70% doesn’t match the very top models, but the point is that it does so with only 3B active parameters. It delivers a similar level of performance to models with 10–20× the parameters while using far less compute.

Real-world task evaluation

Results from 16x Engineer.

TaskScore
Markdown formatting (medium)9.25/10
Folder watcher fix (regular)8.75/10
Next.js TODO feature (easy)8.0/10
Benchmark visualization (hard)7.0/10
TypeScript type narrowing (special)1.0/10
Overall average6.8/10

It’s strong on standard mid-difficulty coding tasks, while niche patterns like TypeScript type narrowing are a weak spot. Among open-source models it outperforms DeepSeek V3, ranking just behind Kimi K2.

Running locally

Hardware requirements

With quantization, it runs on surprisingly practical hardware.

  • RTX 4090 (24GB VRAM): Works with Q4 quantization
  • MacBook Pro (sufficient RAM): Runs in GGUF format
  • Recommended: 64GB+ system RAM

Supported formats

FormatUse
Safetensors (BF16)Full model
FP8Quantized (reduced memory)
GGUFFor llama.cpp/Ollama

Running with Ollama

ollama run qwen3-coder-next

Running with llama.cpp

./llama-server -m Qwen3-Coder-Next.gguf -c 32768 -ngl 99

If you run short on memory, it’s recommended to reduce the context length to 32,768.

Agent integration

Supported platforms

Works with major coding assistants such as Claude Code, Qwen Code, Cline, Continue, and Cursor. It exposes an OpenAI-compatible API, so you can drop it into existing toolchains.

Start a server with vLLM

pip install 'vllm>=0.15.0'
vllm serve Qwen/Qwen3-Coder-Next \
  --port 8000 \
  --tensor-parallel-size 2 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Start a server with SGLang

pip install 'sglang[all]>=v0.5.8'
python -m sglang.launch_server \
  --model Qwen/Qwen3-Coder-Next \
  --port 30000 \
  --tp-size 2 \
  --tool-call-parser qwen3_coder

Tool calling

Supports OpenAI-compatible function calling.

from openai import OpenAI

client = OpenAI(base_url='http://localhost:8000/v1', api_key="EMPTY")

tools = [{
    "type": "function",
    "function": {
        "name": "read_file",
        "description": "ファイルの内容を読み取る",
        "parameters": {
            "type": "object",
            "required": ["path"],
            "properties": {
                "path": {"type": "string", "description": "ファイルパス"}
            }
        }
    }
}]

response = client.chat.completions.create(
    model="Qwen3-Coder-Next",
    messages=[{"role": "user", "content": "main.pyの内容を確認して"}],
    tools=tools,
    max_tokens=65536
)
  • Temperature: 1.0
  • Top P: 0.95
  • Top K: 40

Notes

  • No “thinking” mode: <think></think> blocks are not generated. If you want to see the reasoning process, consider a different model.
  • Struggles with unusual patterns: Accuracy drops on less common code patterns such as TypeScript type narrowing.
  • UI generation is shaky: Formatting issues have been reported on visualization tasks.

Impressions

There’s still a gap versus paid models like the Claude 4 line and GPT-5 series, but as a local coding agent it’s among the strongest right now. Clearing 70% on SWE-Bench with just 3B active parameters is remarkable in terms of efficiency.

Not having to worry about API costs is a big plus. It’s a solid option if you don’t want to send private codebases out of your environment or need to develop offline.

That said, I’m currently using top-tier paid models—Claude Opus 4.5, GPT-5.2, and Gemini 2.5 Pro—so I’ll likely reach for this less often. With Claude Sonnet 5 on the way, I’d still like local LLMs to be around Sonnet 4.5 in capability. I may try some lighter tasks and compare.

References