Tech 12 min read

Claude Fable 5 suspended: Kimi K2.7 Code & Qwen3.7 Max as Claude Code backends

IkesanContents

Claude Fable 5 stopped on June 12, and the way I read Chinese agent models shifted a little.
In April it was a release race — “another 1T MoE,” “another benchmark bump” — but now the front-and-center questions are whether you can swap it in as the backend for Claude Code or Cline, how many tool calls it can make across a long-running task, and how far it keeps output tokens down.

I covered the suspension itself in the post on Claude Fable 5 and Mythos 5 being fully suspended.
When I wrote that on June 13, it was a short-term-evacuation story: drop Fable from your assumptions and fall back to Opus 4.8 or Sonnet 4.6.
Layer Kimi K2.7 Code (released June 12) on top, then line up Qwen3.7 Max, DeepSeek V4, and GLM-5.1, and it stops being “a Claude replacement” and becomes “where do you put your long-running agent base.”

Kimi K2.7 Code leads with Claude Code compatibility

Moonshot AI released Kimi K2.7 Code on June 12.
The Kimi API model list added kimi-k2.7-code and kimi-k2.7-code-highspeed, both with 256K context.
K2.7 Code is described as Kimi’s “strongest coding model,” with long-context instruction following, long-horizon coding, and agent ability all treated as improvements over K2.6.
The weights are also up on Hugging Face under a Modified MIT license. It’s not API-only.

What’s notable is that the official docs ship drop-in instructions for Claude Code, Cline, Roo Code, and OpenCode verbatim.
For Claude Code, they spell out setting ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic and switching ANTHROPIC_MODEL to kimi-k2.7-code. In practice you also rewrite the Opus/Sonnet/Haiku model assignments and the auto-compaction window together.
Read three days after Fable’s suspension, it looks like a pretty blatant land grab.

That said, K2.7 Code isn’t a compatibility model you can hit however you like.
Per the official docs, disabling thinking throws an error, temperature is fixed at 1.0, top_p is fixed at 0.95, and tool_choice can only be auto or none.
In multi-step tool calls, you get an error unless you keep the current turn’s reasoning_content in context.
It has the OpenAI-compatible and Anthropic-compatible shapes, but the harness side also has to run within Kimi’s constraints.

Even at K2.6, in the Qwen3.6-Max-Preview vs Kimi K2.6 comparison I looked at concurrent multi-agent execution and long-horizon coding.
With K2.7 Code, Kimi narrowed the pitch to “coding-only,” “Claude Code compatible,” and “a high-speed variant.”
More than the model-card numbers, the differentiator this time is that they built the on-ramp into existing agent UIs first.

Qwen3.7 Max headlines 35-hour autonomous runs

On the Qwen side, Qwen3.7 Max shipped in May, and Qwen3.7 Plus followed in June.
Max is the text-centric flagship; the official blog put agent-harness compatibility and long-horizon autonomous planning and execution up front.
Together AI’s model page lists a 1M context, SWE-bench Verified 80.4%, SWE-bench Pro 60.6%, Terminal-Bench 2.0-Terminus 69.7, and roughly 35 hours of autonomous execution.

In April’s Qwen3.6-Max-Preview vs Kimi K2.6, the contrast was Qwen as an API-only flagship and Kimi as a large open-weight model.
With Qwen3.7 Max, that API-only line got stronger: it keeps the top tier on the cloud side while extending how long the agent runs.
Qwen3.7 Plus is the multimodal-agent branch that combines vision and language, leaning toward tasks with GUI operation and image/video input.

Looking at this trend, Qwen has started pushing “Qwen you run for a long time on Alibaba Cloud” over “Qwen you can run locally.”
If you want to touch it locally, the previous-generation Qwen3.6 line (27B Dense, 35B-A3B) remains, but the 3.7 open weights are only announced as of this writing, and flagship-grade agent performance sits on the cloud API side.
As a replacement for when a closed, high-performance model like Fable stops, it’s close — but the cloud dependency is also strong.

DeepSeek V4 ships a 1M context as open weights

DeepSeek V4 Preview was released on April 24, so it isn’t post-suspension news.
Even so, lining things up again now, its presence is pretty large.
V4-Pro is 1.6T total parameters, 49B active; V4-Flash is 284B total, 13B active.
Both are 1M context, and both are up on Hugging Face under an MIT license.

In the DeepSeek V4 post, I focused on the CSA/HCA hybrid attention, mHC, and Muon.
The claim that at 1M context V4-Pro drops FLOPs to 27% of DeepSeek-V3.2 and the KV cache to 10% pulls long context back from a catalog figure to an execution-cost story.
If you only use it via API like Claude or Qwen, you don’t have to think about this internal structure, but for self-hosting or on-prem it maps directly onto inference-cost and memory-layout estimates.

The problem is size.
V4-Pro is basically impossible on a personal setup, and even V4-Flash is about 160GB with the official FP4+FP8 weights — over 170GB with the full context. You only just fit into the 140-160GB class once you quantize down to INT4.
”It’s open weights, so I can immediately use it as a Claude Code replacement” doesn’t follow.
Still, shipping both frontier-grade and lightweight options entirely under MIT means a lot.
When an access-restriction event like Fable’s suspension hits, a model whose weights you can keep on hand is less an evacuation site than insurance.

GLM-5.1 sells staying power on long-horizon tasks

Zhipu AI’s GLM-5.1 was also fairly tuned toward agent operation as of April.
744B total parameters, 40B active, 200K context. Officially it’s labeled 754B including the shared experts.
The official docs emphasize up to 8 hours of autonomous execution on a single task, and the plan / execute / test / fix / deliver loop.

In the GLM-5.1 post, I looked at a case where, on a vector-search optimization task, it ran 600+ iterations and 6,000+ tool calls without performance dropping off. The official general phrasing is “hundreds of rounds and thousands of tool calls.”
It isn’t the type that’s always top on single-shot inference or knowledge QA; what it sells is not breaking down in the middle of long work.
That puts it on the same field as Qwen3.7 Max’s 35-hour autonomous runs and Kimi K2.7 Code’s long-horizon coding.

GLM-5.1’s sweet spot looks less like a Claude Code replacement and more like an upper-level loop that supervises multiple agent jobs.
You throw individual code edits to Claude, Kimi, or Qwen, and GLM-5.1 manages progress and send-backs.
In that setup, what matters is long-context retention and tolerance for iteration, more than single-shot reasoning scores.
For the record, the successor GLM-5.2 dropped this month on June 13 — same 744B MoE, context widened to 1M, and Claude Code compatibility carried over. The MIT open weights are slated to follow, and no benchmarks were public at launch. 5.1 is the version one step before it, released in April.

From April’s release barrage to June’s swap-in race

In April, Qwen3.6-Max-Preview, Kimi K2.6, Xiaomi MiMo-V2.5, Tencent Hy3-preview, Ant Ling-2.6-flash, DeepSeek V4, and GLM-5.1 all landed in roughly the same month.
Back then, you could mostly read them by putting total parameters, active parameters, context length, and the SWE-bench Pro score side by side.

Coming into June, how you read the numbers has changed.
Kimi even publishes the Claude-Code-compatible environment variables.
Qwen shows off 35-hour autonomous execution.
DeepSeek puts a 1M context out as open weights.
GLM pushes 8-hour, thousands-of-tool-call staying power.
All of them lean toward a design that runs long inside a harness rather than being smart in chat.

Putting the comparison axes down roughly, it comes out like this.

ModelHeadline pitch right nowDelivery formRealism of self-hosting
Kimi K2.7 CodeDrop-in path into Claude Code, Cline, Roo Code, OpenCodeAPI (high-speed variant) + Modified MIT weights on Hugging Face1T-class MoE is heavy on hand; API-leaning in practice
Qwen3.7 Max1M context and long-horizon autonomous agentAPI-onlyFlagship assumes the cloud
DeepSeek V41M-context open weights, two-tier V4-Pro / V4-FlashHugging Face, MIT, APIEven Flash is ~160GB class with quantization
GLM-5.18-hour-class long-horizon tasks and iteration toleranceAPI, MIT weights744B-class is rough for a personal setup

Just from this table, if you want to try something immediately as an individual, it’s the Kimi K2.7 Code or Qwen3.7 Max API.
If you’re thinking about self-hosting or in-house governance, DeepSeek V4 and GLM-5.1 remain, but the required GPU memory gets heavy fast.
It’s not a “swap it in casually on a local PC” story.

Don’t take the numbers at face value

That said, you can’t take the table’s numbers at face value.
Most of the published benchmarks are vendor self-reported, and third-party reruns are still thin. Calling a model frontier-grade just because SWE-bench ticked up a few points is premature.
Even if, behind the scenes, they’re distilling from a closed Claude-class model, there’s no way to copy the whole capability set in that short a span. All you can pull from an API is output text — you can’t do soft distillation that uses logits (the probability distribution before output). At best it’s hard distillation, i.e. imitating generated data, and the volume is limited too. Even if some tasks that show up on benchmarks catch up, the possibility that an important capability that doesn’t show up is quietly dead very much remains.
Jumping in with “this is amazing” off one specific number is just being an AI hype merchant. I’d rather run it on real tasks for a long time and see where it breaks before judging.

Constraints that remain when using a Claude-compatible harness

Being able to slot into Claude Code or Cline is convenient as a description, but it doesn’t mean you get the same behavior as Claude as-is.
The constraints noted earlier — Kimi’s fixed sampling, required thinking — don’t go away under API compatibility.
In a long session, those differences surface as failures in resume handling or tool calls.

Problems on the Claude Code side remain too.
Even before Fable’s suspension, on Opus 4.8 there were reports of a symptom where court shows up and tool calls hang.
Swapping only the model to Kimi or Qwen doesn’t guarantee the harness compaction, tool-result formatting, MCP-server failure recovery, and billing-cap handling stay the same.

So switching is more realistic to consider per task than swapping everything by model name.
For a short code edit, try Kimi K2.7 Code via Claude Code compatibility.
For long planning and execution, look at Qwen3.7 Max or GLM-5.1’s long-run execution in a separate slot.
If you can’t send data outside, first check whether you have the compute to host DeepSeek V4 Flash or GLM’s open weights.

Fable’s suspension exposed supply routes, not model performance

Fable 5, released on June 9, stopped just three days later on June 12 under a US government export-control directive.
What the directive prohibited was use by foreign nationals, but Anthropic can’t separate inside-vs-outside the US in real time, so it stopped access for all customers. Other models like Opus 4.8 kept running as usual.
The trigger is said to be the discovery of a method to bypass Fable’s safeguards; Anthropic counters that it’s “narrow and minor, a misunderstanding.” It wasn’t a question of scores — it was a case of a supply route slamming shut all at once.
Right after, Kimi led with Claude Code compatibility, Qwen extended long-horizon agents API-only, and DeepSeek put open-weight 1M context out there.

The point that’s left here isn’t which model is smarter than Claude.
Whether your work stops because of API dependence, breaks because of harness dependence, or stalls on compute even when you hold the weights.
The Fable episode surfaced that question before any performance gap.

References