GLM-5.1 (Zhipu, 744B / 40B MoE, MIT): 58.4% SOTA on SWE-Bench Pro, 8h / 6,000+ tool calls without degradation

China’s Zhipu AI announced its flagship model GLM-5.1 on April 7 via its official blog. The release drew significant attention on Hacker News, pulling in 497 points and 201 comments.

GLM-5.1’s headline feature is “Long-Horizon Task” capability—the ability to sustain autonomous task execution over hours and hundreds of iterations without performance degradation. Previous LLMs have tended to lose output quality or fall into loops during extended work sessions. GLM-5.1 tackles this problem head-on.

Model Specs

GLM-5.1 builds on its predecessor GLM-5 through post-training enhancements, with a Mixture-of-Experts (MoE) base architecture.

Spec	GLM-5.1
Total Parameters	744B
Active Parameters/Token	40B
Context Window	200K tokens
Max Output Length	131,072 tokens (reasoning tasks)
Pre-training Data	28.5T tokens
License	MIT
Inference Frameworks	vLLM, SGLang, KTransformers, Transformers

The previous GLM-5 shared the same 744B/40B parameter split, but the jump from GLM-4.5 is substantial. GLM-4.5 had 355B total parameters (32B active) with 23T training tokens, so GLM-5.1 roughly doubles the parameter count and adds 24% more training data. The architecture incorporates DeepSeek Sparse Attention (DSA) to reduce the cost of long-context processing.

MoE is the same architecture used by Google Gemma 4 and LLM-jp-4. Only a subset of parameters activates for each token—40B out of 744B—making inference dramatically cheaper compared to a dense model of equivalent size.

Benchmark Results

GLM-5.1 scored 58.4% on SWE-Bench Pro, edging out GPT-5.4 (57.7%) and Claude Opus 4.6 (57.3%) for SOTA. But looking across the full benchmark suite, the model’s strengths and weaknesses are clearly delineated.

Software Engineering

Benchmark	GLM-5.1	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
SWE-Bench Pro	58.4	57.7	57.3	54.2
NL2Repo	42.7	41.3	49.8	33.4
Terminal-Bench 2.0	69.0	75.1	65.4	68.5
CyberGym	68.7	—	66.6	—

Top marks on SWE-Bench Pro and CyberGym (security-related tasks). But on NL2Repo (generating entire repositories from natural language), Claude Opus 4.6 leads by 7 points at 49.8%. Terminal-Bench 2.0 also falls short of GPT-5.4’s 75.1%.

Reasoning

Benchmark	GLM-5.1	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
HLE (Tool Use)	52.3	52.1	53.1	51.4
AIME 2026	95.3	98.7	95.6	98.2
GPQA-Diamond	86.2	92.0	91.3	94.3

Reasoning benchmarks tell a different story. GLM-5.1 trails across the board—95.3% on AIME 2026 (math competition) and 86.2% on GPQA-Diamond (expert knowledge), both 5-8 points behind the leaders. Zhipu AI itself positions GLM-5.1 as “not a reasoning-first model,” and these numbers confirm that the team prioritized agent task endurance over general reasoning prowess.

Agent Tasks

Benchmark	GLM-5.1	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
BrowseComp (w/ context)	79.3	82.7	84.0	85.9
MCP-Atlas (Public)	71.8	67.2	73.8	69.2
Tool-Decathlon	40.7	54.6	47.2	48.8

MCP-Atlas puts GLM-5.1 near the top at 71.8%, close to Claude Opus 4.6. But Tool-Decathlon (multi-tool composition) shows GPT-5.4 pulling far ahead at 54.6%. BrowseComp also sees Gemini 3.1 Pro leading with 85.9% against GLM-5.1’s 79.3%.

Long-Horizon Performance

If you only look at the benchmark numbers, GLM-5.1 reads as “SOTA in some coding tasks, on par or behind elsewhere.” But that misses the point.

VectorDBBench

On SIFT-1M, a vector search benchmark with a 95% recall threshold, GLM-5.1 was given 600+ iterations. After 6,000+ tool calls, it reached 21,500 QPS. Claude Opus 4.6’s best single-session result was 3,547 QPS—roughly 6x lower.

The optimization trajectory showed a staircase pattern. Around iteration 90, around 240, and beyond, six structural pivots occurred in total. GLM-5.1 analyzed its own benchmark logs and autonomously decided “this approach has hit its ceiling, time for a different structure.”

graph TD
    A[Start: Basic Implementation] --> B[~90 iterations<br/>1st Pivot]
    B --> C[~240 iterations<br/>2nd Pivot]
    C --> D[Pivots 3-6<br/>Structural Redesign]
    D --> E[600+ iterations<br/>21,500 QPS]
    
    F[Claude Opus 4.6<br/>Single Session<br/>3,547 QPS] -.->|~6x gap| E
    
    style E fill:#2d5,stroke:#333,color:#000
    style F fill:#aaa,stroke:#333,color:#000

GPU Optimization

On CUDA kernel benchmarks, GLM-5.1 achieved 3.6x speedup over a PyTorch reference implementation after 1,000+ iterations. Claude Opus 4.6 actually beat it here with 4.2x, so GLM-5.1 doesn’t win everything. Meanwhile, GLM-5 plateaued much earlier, showing that GLM-5.1’s endurance is a clear generational improvement.

How to Use It

API

API access is available through bigmodel.cn. Pricing is quota-based, with peak hours (UTC+8 14:00-18:00) consuming 3x. A promotional rate applies through April 2026.

Z.AI Coding Plan

Starting at $10/month, the subscription integrates with coding tools like Claude Code, Cline, Kilo Code, Roo Code, and OpenCode. Just change the model name in your config. With the rise of AI coding editors like Cursor 3, this plug-in-as-backend design makes sense.

Open Source

Weights are available on HuggingFace at zai-org/GLM-5 under MIT license—commercial use, fine-tuning, and redistribution all permitted. An FP8 quantized version (GLM-5.1-FP8) is also available. Here’s a vLLM deployment example:

vllm serve zai-org/GLM-5 \
     --tensor-parallel-size 8 \
     --gpu-memory-utilization 0.85 \
     --speculative-config.method mtp \
     --speculative-config.num_speculative_tokens 3 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-5

Tensor parallelism across 8 GPUs is the baseline, so running this locally requires serious hardware. At 744B parameters, GLM-5.1 is among the largest open-weight models out there—FP8 alone needs 370GB+ of VRAM. A different league entirely from local AI server solutions like AMD Lemonade.

Zhipu AI’s Position

Zhipu AI spun out of Tsinghua University’s THUDM group and is one of China’s major LLM developers alongside Baidu, Alibaba (Qwen series), and DeepSeek. The GLM series has been in development since 2021, with specialized models like GLM-OCR also in their lineup.

The differentiation among Chinese LLMs is clear: DeepSeek pursues inference efficiency, Qwen goes wide on multimodality, and Zhipu AI has carved out agent endurance as its niche. The “8 hours, 6,000+ tool calls” number is unmatched right now.

Whether this endurance translates to production environments is an open question, though. Benchmarks like VectorDBBench and GPU kernel optimization feature clearly defined goals with numerically measurable progress. Real-world software development—with ambiguous requirements that shift mid-project—is a different beast, and that’s where users will have to run their own tests.

The Conductor Use Case

GLM-5.1’s benchmarks show it losing to Claude Opus 4.6 and GPT-5.4 in one-shot reasoning and code generation.
But no other model can run 600+ iterations without degradation.
To leverage this asymmetry, the play isn’t to have GLM-5.1 solve everything itself. Instead, put it in a conductor role: decompose tasks, dispatch them to other models, review results, and issue next instructions.

Picture a software development scenario.
For a large-scale refactoring effort, individual file modifications go to Claude Opus 4.6 or Claude Code sub-agents.
GLM-5.1 maintains awareness of the overall project state while managing hundreds of sub-task dispatches, result verification, and pivot decisions.
The “autonomous structural pivoting” demonstrated in VectorDBBench is exactly the capability you want in an orchestration layer.

graph TD
    G[GLM-5.1<br/>Conductor] -->|Task decomposition| C1[Claude Opus 4.6<br/>Code modification]
    G -->|Task decomposition| C2[GPT-5.4<br/>Test generation]
    G -->|Task decomposition| C3[Specialized models<br/>Security review, etc.]
    C1 -->|Results| G
    C2 -->|Results| G
    C3 -->|Results| G
    G -->|Progress evaluation & pivots| G

This pattern resembles Claude Code’s parallel sub-agent architecture or OpenAI’s Swarm multi-agent framework.
The difference: the conductor itself can run for hours across thousands of steps without degradation.
Existing orchestration layers are either stateless scripts or limited to maintaining state within a single context window.
GLM-5.1 combines a 200K context with long-term consistency, keeping project-wide context alive throughout its decision-making.

Challenges exist, of course.
Multi-model API calls mean ballooning latency and cost.
GLM-5.1’s own API charges 3x during peak hours, making long-running job costs hard to predict.
And for the conductor role to work, GLM-5.1 needs to accurately evaluate sub-task outputs—but with weaker reasoning benchmarks than the top models, whether it can properly review code written by GPT-5.4 or Claude Opus 4.6 requires real-world validation.

As a Review Gate

Zooming into the conductor role, a “review gate” pattern emerges.
Consider a large codebase refactoring where hundreds of file modifications are dispatched to Claude Code or Cline sub-agents in parallel.
Each agent handles individual modifications well. The problem is cross-cutting consistency.
If you change type definitions in one module, do the dependent modules’ tests still pass?
If you modify an API interface, does the caller’s error handling follow suit?
These cross-cutting checks require maintaining project-wide context across hundreds of iterations.

Traditional CI/CD pipelines cover some of this through automated testing, but verifying design intent consistency and implicit contracts (“this module assumes X”) is where LLMs have an edge.
GLM-5.1’s endurance means it can review an entire multi-hour refactoring session end-to-end.
Delegate individual modifications to Claude Opus 4.6, have GLM-5.1 continuously validate each output and decide whether to accept or reject—this division of labor could actually work.

The reasoning benchmark gap matters here, though.
A reviewer weaker in reasoning than the reviewee is a dynamic that occurs in human code review too, but LLMs carry the risk of being “confidently wrong.”
Whether GLM-5.1 can properly evaluate sophisticated optimization code written by GPT-5.4 or Claude Opus 4.6 can only be settled through production use.
One mitigation: scope the review to “design policy compliance” and “interface contract adherence” rather than “logical correctness.”
Separating the bird’s-eye-view perspective from fine-grained reasoning judgments puts GLM-5.1 in a position where its strengths actually matter.