Tech 14 min read

Tencent Hy3-preview and Ant Ling-2.6-flash Drop the Same Week, Splitting Chinese Open MoEs at 295B and 104B

IkesanContents

In the fourth week of April 2026, two open-weight Chinese MoE LLMs landed back-to-back.
On April 22, Ant Group’s Ant Ling dropped the lightweight, efficiency-focused Ling-2.6-flash (104B/7.4B active). The next day, Tencent’s Hunyuan team released the frontier-class Hy3-preview (295B/21B active) as open weights.

Chinese model releases have been rolling all month: Zhipu AI’s GLM-5.1, Qwen3.6-Max-Preview and Kimi K2.6, and Xiaomi MiMo-V2.5/V2.5-Pro. This week’s pair fills both ends of the spectrum—heavyweight benchmark chaser on one side, lightweight token-efficiency play on the other—and the design intent for each is very different.

Update (2026-04-25, running it locally): Ling-2.6-flash weights aren’t public yet, but the previous-generation Ling-flash-2.0 (100B / 6.1B active, bailing_moe architecture, MIT) already has an MLX MXFP4 build (54.7GB). I ran it through SwiftLM on an M1 Max 64GB — full write-up at Running a Non-Qwen MoE on SwiftLM: Ling-flash-2.0 MXFP4 on M1 Max 64GB. Useful as a baseline for predicting how 2.6-flash will behave locally once those weights ship.

Below, each model on its own.

Tencent Hy3-preview (295B/21B frontier-class open MoE)

On April 23, 2026, Tencent’s Hunyuan team (the X account is still @TencentHunyuan, but the model name and branding have shifted toward “Tencent Hy”) released a new model called “Hy3 preview” as open weights.
The framing is “the first open-source release after the Hunyuan rebuild,” and numerically it corresponds to the Hunyuan 3.0 generation.
The X announcement reads the team name as Tencent Hy /haɪ/—a play on “Hi”—and the model is Hy3-preview. The HuggingFace path uses the same name: tencent/Hy3-preview.

Hy3 preview is the card that says “Tencent itself is putting out a 3.0-generation frontier LLM as open weights too,” in the middle of this run of releases.

Hy3 preview architecture and scale

Cross-referencing the model card and the GitHub repo, the configuration is:

  • Total parameters: 295B
  • Active parameters: 21B (Fine-grained MoE, 192 experts with Top-8 activation)
  • MTP (Multi-Token Prediction) layer: 3.8B (separate from the main body)
  • Transformer layers: 80 (+1 MTP layer)
  • Attention: GQA, 64 heads / 8 KV heads, head dim 128
  • Hidden dim: 4096, FFN intermediate dim: 13312
  • Vocab size: 120,832
  • Context length: up to 256K
  • Precision: BF16 / F32
  • License: custom “Tencent Hy Community License Agreement”

The “295B total / 21B active” balance sits a step below DeepSeek-V3 (671B / 37B active) and Zhipu GLM-5.1 (744B / 40B active), aiming for “frontier-class” performance while keeping inference cost in check.

Where the published benchmarks land

The README compares against DeepSeek-V3 Base and GLM-4.5 Base. The numbers Tencent leads with are aggressive.

  • GSM8K: Hy3 preview 95.37% / DeepSeek-V3 series and GLM-4.5 series in the 88–90% range
  • MATH: 76.28% / DeepSeek-V3 series at 59.37%
  • CRUXEval-I: 71.19% (top of the three)
  • LiveCodeBench-v6: 34.86% (others 27–30%)
  • MMLU-Pro: 65.76% (close to peers)

Tencent also reports strong results on hard STEM benchmarks (FrontierScience-Olympiad, IMOAnswerBench), the Tsinghua Qiuzhen Honors College math PhD prelim (Spring ‘26), and CHSBO 2025 (the Chinese High School Biology Olympiad).
These are Base-vs-Base comparisons, so reading them as final Instruct quality is risky—but at least on academic and competition-style tasks, the trend clearly puts Hy3 preview ahead of the DeepSeek-V3 line.

Agent and reasoning modes

Hy3 preview is billed as a “Reasoning & Agent Model,” and the OpenAI-compatible API exposes a switch:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = client.chat.completions.create(
    model="hy3-preview",
    messages=[{"role": "user", "content": "..."}],
    temperature=0.9,
    top_p=1.0,
    extra_body={"chat_template_kwargs": {"reasoning_effort": "high"}},
)
  • reasoning_effort: "high" → for hard reasoning tasks (with thinking traces)
  • reasoning_effort: "no_think" → direct response
  • Recommended sampling: temperature=0.9, top_p=1.0

This “thinking on/off as an API arg” pattern lines up with Qwen3.6-35B-A3B’s Hybrid Thinking and the GLM-5.1 series, all clearly aimed at agent workloads. Shipping a single MTP layer with the model, on the assumption of speculative decoding, is also a deliberate fit for vLLM’s spec-decode path.

Can you run it locally? Basically no

If you take the 295B BF16 weights as-is, the rough sizes are:

PrecisionApprox. weight size
BF16~590GB
INT8 equivalent~295GB
4-bit quantized~150–170GB

The official README recommends “8 GPUs (H20-3e or higher-memory class),” well outside any consumer setup.

  • 4090 × multiple cards (24GB each): VRAM doesn’t reach the baseline
  • RTX 6000 Ada / Pro class (48–96GB): even 8 cards struggle at BF16; INT4 might fit on paper but the bandwidth/PCIe story without a dedicated fabric is rough in practice
  • H100 80GB × 8: the official target class. Floor for serious deployment
  • Apple Silicon (M3 Ultra 192GB / 256GB): a single machine might handle INT4 once 4-bit weights are properly packaged, but Tencent isn’t shipping prebuilt 4-bit weights yet—you’d have to push it through AngelSlim yourself
  • CPU + lots of RAM: technically possible, but even at 21B active the per-token data movement is too heavy for interactive use

So “touching Hy3 preview itself at home” isn’t a realistic option, at least for now.
The honest paths are vLLM / SGLang on H20-3e or H100 cloud instances, or Tencent Cloud’s hosted inference endpoint.

What local-first users actually look at: the smaller Hunyuan models

Tencent Hunyuan has been shipping smaller open-weight models in the same family for a while. If you’re after local or embedded use, this is the side to look at.

  • Hunyuan-A13B (Tencent-Hunyuan/Hunyuan-A13B)
    80B total / 13B active fine-grained MoE. 256K context. Fast/Slow Thinking switch and strong on agent benchmarks (BFCL-v3, τ-Bench, C3-Bench). A GPTQ-based W4A16 quantization recipe is provided, so you can get a feel for the Hy3-preview-style agent orientation in a more manageable size.
  • Hunyuan-7B Instruct (tencent/Hunyuan-7B-Instruct)
    Dense 7B. Comfortably runs on a 24GB-class consumer GPU or a single Apple Silicon machine.
  • Hunyuan dense small series (0.5B / 1.8B / 4B / 7B)
    Officially announced for automotive, smart home, smartphone, and PC use. They inherit the A13B training recipe, so this is the natural starting point if fine-tuning is on the roadmap.

A simple size sketch:

flowchart LR
  A[Hunyuan 0.5B / 1.8B / 4B Dense] --> B[Hunyuan-7B Instruct]
  B --> C[Hunyuan-A13B<br/>80B total / 13B active MoE]
  C --> D[Hy3 preview<br/>295B total / 21B active MoE]
  A -. local-friendly .-> A
  B -. local-friendly .-> B
  C -. workstation class .-> C
  D -. multi-GPU server / cloud .-> D

It helps to keep Hy3 preview in the same ring as DeepSeek-V3 / GLM-4.5, and Hunyuan-A13B and the 7B dense models in the same ring as DeepSeek-V2-Lite or the smaller Qwen3 locals.

Where it sits in the recent run of Chinese open releases

Lining up the recent open / semi-open releases makes Hy3 preview’s position clearer.

ModelTotal / ActiveContextRelease form
DeepSeek-V3 series671B / 37B128KOpen weights
Zhipu GLM-5.1744B / 40B200K+Open weights / API
Qwen3.6-Max-Previewundisclosed (flagship)longAPI first
Kimi K2.6large (undisclosed)longAPI + partial open
Tencent Hy3 preview295B / 21B256KOpen weights
Xiaomi MiMo-V2.5-Proundisclosed1M (omni side)API only
Qwen3.6-35B-A3B35B / 3B128K+Open weights (local viable)
  • The largest fully open model is still GLM-5.1; Hy3 preview slots in as the “manageable upper bound” one tier below.
  • The “size-for-performance” pitch lines up with beating DeepSeek-V3 on GSM8K and MATH while staying under half the total parameter count.
  • Against the API-only flagships (Xiaomi MiMo, Kimi K2.6’s top tier), going fully open-weight is the clear differentiator.

The broader Chinese-LLM mood, as covered in the China AI distillation war piece, is “catch up to the top tier through distillation and agent optimization”—Hy3 preview lands cleanly in that pattern.

Ant Ling-2.6-flash (104B/7.4B lightweight MoE pitching “7x token efficiency”)

Ant Group’s AI team Ant Ling released Ling-2.6-flash on April 22, 2026.
The X post (@AntLingAGI) leans on phrases like “1 trillion parameter flagship” and “Fast-Thinking,” and some translations have circulated as “Ling-2.6-1T”—but the actual release is a sparse MoE with 104B total / 7.4B active, branded Ling-2.6-flash. It’s not the flagship; it’s the new lightweight, high-efficiency model in the line.
The 1T-class flagship slot still belongs to Ling-2.5-1T from February. Today’s release is best read as a smaller sibling in the same family.

Ant Group’s wider arc has been: open-source the LingBot-World world model in February 2026, while incrementally pushing the text LLM line from 2.5-1T to 2.6-flash.
With Hy3-preview, Qwen3.6-35B-A3B, and Zhipu GLM-5.1 all leaning into “Fine-grained MoE + agent specialization,” Ling-2.6-flash sits at the smallest caliber of that group.

Ling-2.6-flash architecture

Cross-referencing the public details from Ant Ling, Novita AI, and OpenRouter:

ItemValue
Total parameters104B
Active parameters7.4B
StructureSparse MoE (256 experts)
Attentionhybrid 1:7 MLA + Lightning Linear
Context length256K
Vocab size~157K
PrecisionBF16 / FP8 / INT4 (open-source planned)
TrainingAgentic RL (built around agent use)

At 7.4B active, inference sits between Qwen3.6-35B-A3B (35B/3.3B active) and GLM-4.5-Air.
The fine-grained MoE goes all the way down to 256 experts, and attention mixes MLA (Multi-head Latent Attention) with Lightning Linear Attention at a 1:7 ratio—a hybrid design optimized to run 256K context cheaply.

Hybrid linear attention: standard Transformer attention scales quadratically with sequence length, so long-context work blows up in memory and time. Making linear attention the workhorse and sprinkling in conventional attention (here, MLA) keeps cost roughly linear at long context without losing accuracy—that’s the design intent.

What “Fast-Thinking” actually means

“Fast-Thinking” is Ant Ling’s headline marketing term, and in practice it means “get to the answer without unrolling a long thinking trace.”
Reasoning models (the o1 line, Ring-1T, etc.) emit a lot of internal thought before answering, which inflates output token counts and pushes both API spend and latency up. Ling-2.6-flash takes the opposite bet, indexing on intelligence per token.

Ant Ling’s own example: the output tokens consumed to complete the full Artificial Analysis Intelligence Index evaluation come out roughly as follows.

ModelOutput tokens for the evalRelative
Ling-2.6-flash~15M1.0x
Nemotron-3-Super~110M+7.3x or more

Intelligence Index sits at 26—matching or slightly above peers, but at roughly 7x lower token cost, per Ant Ling’s measurements.
Cloud LLM pricing is metered on input/output tokens, so this difference translates directly to “same task, same quality, ~1/7 the bill”—a serious gap for agents and high-frequency automation.

Benchmarks lean entirely on agent work

The benchmark numbers the README and various roundups highlight cluster around agent and function-calling evals.

BenchmarkLing-2.6-flashReference
BFCL-V4 (function calling)67.04Nemotron-3-Super 35.12
PinchBench81.10Nemotron-3-Super 73.10
IFBench58.10
Multi-IF Turn-374.85
LongBench-v254.80
CCAlignBench (Chinese)7.44top of class for the size tier
Intelligence Index (AA)26+10 over Ling-flash-2.0

A 30+ point lead over Nemotron-3-Super on the Berkeley Function Calling Leaderboard V4 is the cleanest signal: the eval mix is clearly tilted toward tool calling, multi-turn instruction following, and long-context reference work.
On the other hand, Ant Ling concedes that Nemotron-3-Super and Qwen3.5-122B-A10B are ahead on the math-olympiad benches (AIME 2025, MATH-500) and one-shot code benches like LiveCodeBench.
Better to read this as “a model that runs the agent loop fast and cheap” rather than “a model whose day job is hard reasoning.”

340 tok/s throughput on 4 GPUs

Inference performance, on NVIDIA H20 × 4 (tensor parallelism = 4):

  • Peak: ~340 tokens/sec
  • Steady output: 215 tokens/sec
  • Decode throughput at 65K context / 65K output: ~4.38x (normalized to GLM-4.5-Air = 1)
  • Prefill throughput at the same conditions: ~4.68x normalized (Nemotron-3-Super ~2.12x)

Crossing 300 tok/s on 4× H20 while still handling 256K context is genuinely fast for a fine-grained MoE plus linear-attention design, and the size profile favors high-throughput online inference for agent platforms over straight chat.

Pricing and channels

On the distribution side, both hosted API and open weights are on the roadmap.

  • OpenRouter: registered as inclusionAI/ling-2.6-flash. Both free tier (:free suffix) and paid tier are live
  • Novita AI: BYOK via OpenRouter, or directly through Novita’s own endpoint
  • Alipay Tbox (ling.tbox.cn): Ant Group’s official access point
  • LingDT: commercial brand via Ant Digital Technologies

Paid pricing: $0.10 input / $0.30 output per 1M tokens.
That’s a direct shot at GPT-5 Mini and Kimi K2.6. Multiplied through the token-efficiency claim, the effective cost per equivalent task drops considerably on Ant Ling’s own numbers.

A one-week free API window started alongside the launch—the OpenRouter :free endpoint works without any extra key setup.

Trying it via the OpenAI-compatible endpoint

OpenAI-compatible API, so swapping the base URL and key drops it into existing Python code.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-...",  # OpenRouter API key
)

resp = client.chat.completions.create(
    model="inclusionai/ling-2.6-flash:free",  # free tier
    messages=[
        {"role": "system", "content": "You are a concise coding assistant."},
        {"role": "user", "content": "Write a Python function that returns the n-th Fibonacci number."},
    ],
    stream=True,
)

for chunk in resp:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

Function calling (tools=[...]), streaming, and structured outputs are all supported, so existing Claude Code-style, LangChain-style, and OpenAI SDK clients work by just swapping the model name.

Open-source plans

Weight release is announced but not yet shipped. Ant Ling lists:

  • BF16 full-precision weights
  • FP8 quantized weights
  • INT4 quantized weights
  • Linghe kernels (MoE inference kernels)

No firm date yet, but committing to ship FP8 + INT4 alongside the full weights, plus their own MoE kernels, is a confident move.
Once INT4 lands, the 7.4B-active × 256-expert layout still won’t fit fully on a single consumer GPU, but a 24GB card plus system-RAM offload becomes a reasonable test rig.

Where it sits in the Ling lineup

Mapping the current Ling family, the role split is clean.

ModelTotal paramsActivePosition
Ling-1T1T≈50BFirst-gen non-thinking flagship, October 2025
Ling-flash-2.0100B6.1BLightweight MoE released in 2025, MIT-licensed (bailing_moe)
Ling-2.5-1T1T63BCurrent flagship, February 2026, hybrid linear attention
Ring-2.5-1T1TThinking-model counterpart, same generation
Ling-2.6-flash104B7.4BHigh-efficiency agent-focused small MoE, April 2026 (this release)

The flagship side (Ling-2.5-1T / Ring-2.5-1T) is the “think about the world at full scale” tier. Ling-2.6-flash is the “run a lot of cheap production tool calls” tier.
From the user side, this gives a clean split: 2.5-1T-class for one-shot hard problems, 2.6-flash for the per-step grind of agent loops.