Tech 9 min read

Xiaomi ships MiMo-V2.5 and MiMo-V2.5-Pro together — 1M omnimodal and 1,000-tool agent, API only

IkesanContents

On April 22, Xiaomi released two MiMo-series models as public beta at the same time: “MiMo-V2.5-Pro” and “MiMo-V2.5”.
Previously the MiMo lineup centered on V2-Flash (open-weight 309B MoE) and V2-Pro (API-only). With the V2.5 generation, Pro was pushed into frontier territory while the standard model gained native omnimodality and a 1M context window.

This lands one day after Qwen3.6-Max-Preview and Kimi K2.6 dropped together, which means three Chinese 1T-class flagships have all shipped in the fourth week of April.

How the two models are positioned

Within the same generation, the Pro and the standard model clearly target different layers.

flowchart LR
  M[MiMo-V2.5 series] --> M1[MiMo-V2.5-Pro<br/>flagship<br/>long-horizon agent focus]
  M --> M2[MiMo-V2.5<br/>native omnimodal<br/>1M context]
  M1 -.~2x cost.-> P[API usage-based<br/>2x multiplier]
  M2 -.~half cost.-> Q[API usage-based<br/>1x multiplier]

Pro’s pitch: “an agent model that matches Claude Opus 4.6 and GPT-5.4 on performance, using fewer tokens.”
The standard MiMo-V2.5 is positioned as “Pro-class general agent performance in a single model that also handles images, video, and audio.”
Seen from V2-Pro, this is a generational upgrade that delivers both a Pro successor (better performance, better token efficiency) and an omnimodal sibling for the standard tier.

Xiaomi’s own framing: Pro is “our strongest model ever,” and the standard tier is “Pro-level agent performance at roughly half the cost.”
Pricing is a token-credit system with a fixed multiplier: Pro = 2x, standard = 1x.
There is no tiered pricing by context length — the multiplier stays flat even if you fill up the full 1M.

Core specs

Here is what’s currently public, pulling from official material and third-party reporting.
Xiaomi did not disclose much about parameter counts or architectural details, so some numbers had to be sourced via OpenRouter and MarkTechPost coverage.

ItemMiMo-V2.5-ProMiMo-V2.5
Total parameters1T (per reports)Undisclosed
Active parameters42B/token (per reports)Undisclosed
ArchitectureSparse MoE (details undisclosed)Sparse MoE + native omnimodal
ModalitiesText-focusedText + image + video + audio (native)
Context lengthUndisclosed (effectively long-context)1,000,000 (1M) tokens
LicenseClosed, API-onlyClosed, API-only
Price multiplier2x1x
Released2026-04-22 public beta2026-04-22 public beta

Pro’s 1T / 42B-active figure lands in the same range as same-day peers Kimi K2.6 (1T / 32B active) and Zhipu GLM-5.1 (744B / 40B active).
“1T MoE, 30–40B active” is becoming the standard shape for Chinese flagships as of spring 2026.

As for the standard MiMo-V2.5’s “native omnimodal” claim: Xiaomi says it did not bolt a vision encoder onto a text model, but instead trained on image, video, and audio from the start. The Video-MME 87.7 and CharXiv RQ 81.0 scores below should be read with that in mind.

Benchmarks

The main published numbers, side by side.

BenchmarkMiMo-V2.5-ProMiMo-V2.5Ref: Claude Opus 4.6Ref: GPT-5.4
SWE-bench Pro57.2comparablecomparable
Claw-Eval63.862.3comparablecomparable
τ3-Bench72.9comparablecomparable
Video-MME87.7
CharXiv RQ81.0
MMMU-Pro77.9

Pro’s SWE-bench Pro 57.2 sits in the same band as the April 21 releases — Qwen3.6-Max-Preview (57.30) and Kimi K2.6 (58.6).
In a single week, Chinese 1T-class MoEs have all bunched up around SWE-bench Pro 57–59.

The interesting data point is the standard MiMo-V2.5’s Claw-Eval of 62.3 — only -1.5pt behind Pro’s 63.8.
A text-only Pro and a natively-omnimodal standard variant being this close on agent performance is unusual.
Many designs lose text performance when image/video/audio understanding is bolted on, so Xiaomi’s base training recipe is likely different here.

Matching scores on Claw-Eval with 40–60% fewer tokens

More important than the raw benchmark scores is the token-efficiency story.
Per Xiaomi’s own description, Pro “hits 64% Pass³ on Claw-Eval at roughly 70K tokens per trajectory.”

These are agent-evaluation-specific metrics.
Pass³ is the rate of solving the same task three times in a row — harder than Pass¹ (just solve it once).
Trajectory is the full “thinking + tool calls + observations” bundle for finishing one task; here, the whole thing stays under ~70K tokens per task.

Versus comparable-score peers like Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4, Pro claims it reaches the same outcome with 40–60% fewer tokens per trajectory.

In real-world agent use, “tokens per task” maps directly onto the API bill.
Solving SWE-bench or Claw-Eval with the same score but at roughly half the tokens means the effective unit cost drops to about half.
Pro’s main commercial weapon is not the listed per-token price — it’s total cost per task.

Long-horizon tasks with 1,000+ tool calls

Pro’s other headline claim is long-horizon task handling: “can autonomously complete professional tasks involving 1,000+ tool calls.”
The target use case is a task that runs for hours to days — something a human specialist would spend several days on, completed in a single run.

This overlaps heavily with Zhipu GLM-5.1’s pitch (no degradation over 8 hours / 6,000 tool calls) from last month.
Where GLM-5.1 emphasizes “doesn’t degrade over long runs,” MiMo-V2.5-Pro frames it as “long-horizon task × token efficiency” — the same long-horizon agent direction pushed further toward commercial deployment.

Put the two specs next to each other:

  • GLM-5.1: 744B / 40B active, SWE-bench Pro 58.4, no degradation across 6,000+ tool calls
  • MiMo-V2.5-Pro: 1T / 42B active, SWE-bench Pro 57.2, autonomous completion of 1,000+ tool calls, 40–60% fewer tokens

SWE-bench Pro is almost identical between the two; Pro is racking up catalog points on efficiency and on the general-purpose benchmarks (τ3-Bench, Claw-Eval).

Omnimodality in the standard MiMo-V2.5

The standard model’s pitch: “Pro-equivalent agent performance at half the cost, plus native omnimodality.”
Very few omnimodal models handle text, image, video, and audio in a single model while also shipping a native 1M context.

ModelModalitiesContext lengthLicense
MiMo-V2.5Text + image + video + audio1,000,000Closed, API-only
Qwen3-Omni (30B / 3.3B active)Text + image + video + audio262,144Open-weight
Gemma 4 26B A4BText + image + audio (audio on E4B)262,144Apache 2.0

Qwen3-Omni is the open-weight entry in the same omnimodal track, capped at 256K context.
MiMo-V2.5 extends that to 1M while keeping omnimodality, which is what lets it fit “video + bulk logs + long-horizon agent” into a single context.
Conversely, Qwen3-Omni’s advantage is running locally. The two split cleanly along use case.

The CharXiv RQ 81.0, MMMU-Pro 77.9, and Video-MME 87.7 scores are all strong for a standard-tier model — solid on image reasoning, paper-table parsing, and video understanding.
Video-MME 87.7 is in the same range as Gemini 3 series and Qwen3-Omni, or slightly better.

You can’t run it locally right now

Bluntly: as of 2026-04-23, neither MiMo-V2.5-Pro nor MiMo-V2.5 can be run locally.
Both are offered as public beta through Xiaomi’s AI Studio and API Platform, and the weights have not been released.
The official page says “Coming Open Source — Stay tuned,” but there is no date.

The XiaomiMiMo Hugging Face org currently has no V2.5 repos. The previous generation, however, has been published under MIT:

  • MiMo-V2-Flash — 309B MoE / 15B active, FP8 inference, SGLang recommended, runs on consumer hardware with KTransformers CPU offloading
  • MiMo-V2-Flash-Base — the base model
  • MiMo-VL-7B-SFT-2508 / MiMo-VL-7B-RL-2508 — 8B-class image-text models
  • MiMo-Audio-7B-Base / -Instruct — 8B-class audio I/O models
  • MiMo-Embodied-7B — 8B-class agent-oriented image-text model

So the realistic option for running MiMo locally is still V2-Flash, not V2.5.

What running it at home would look like

Even if V2.5 were released open source, running a 1T / 42B-active model at home is rough.
Like Kimi K2.6, the VRAM budget narrows the options to:

SetupTarget hardwareFeasibility
Raw FP162TB+ VRAM-equivalentImpossible for individuals
FP81TB+ VRAM-equivalentTough for individuals
INT4 / GGUF500GB+ VRAM-equivalentData-center scale
KTransformers CPU/GPU offloadingLots of DDR + single GPUPossible on consumer hardware, but slow
Cloud APINoneThe only real option right now

For individual use, the pragmatic split is: “V2.5-Pro via cloud API for now, local on V2-Flash or the Qwen3.6-35B family.”
The range covered in running Qwen3.6-35B-A3B on M1 Max 64GB and the Qwen3.6-27B Dense vs 35B-A3B MoE comparison on MLX/Ollama is about the realistic ceiling for running flagship-class agent models on M1/M2 Max-class personal hardware.

API availability

Both models are served via Xiaomi’s AI Studio (web UI) and API Platform (OpenAI-compatible endpoint).
MiMo-V2.5-Pro also routes through OpenRouter under the model ID xiaomi/mimo-v2.5-pro.

Position vs. Claude / GPT

Xiaomi officially claims Pro “matches Claude Opus 4.6 and GPT-5.4 on major benchmarks.”
Anthropic has since moved Claude Opus 4.7 with x-high self-verify, so a generation above Opus 4.6 is already out — and the gap numbers are likely to shift again.

What MiMo-V2.5-Pro is actually going for is less “keep catching up to the frontier” and more “deliver the same performance tier using fewer tokens.” It is not trying to be the top of the leaderboard.
Rather than chasing GPT-5.4 or Claude Opus 4.7, Pro is positioned to carefully pick up customers who want to run “SWE-bench Pro 57-tier agents for roughly half the API bill.”