Can Xiaomi MiMo-V2.5 actually run on a Mac or ROCm?

When Xiaomi MiMo-V2.5 series first shipped API-only, the weights weren’t public yet.
Since then, XiaomiMiMo/MiMo-V2.5 and XiaomiMiMo/MiMo-V2.5-Pro have appeared on Hugging Face.

The question I usually care about is “does it run on a Mac?” and “can I run it on ROCm?”
As of 2026-04-30, you can’t casually run it on a normal Mac or an EVO-X2-class ROCm box.
That said, there’s a MiMo V2.5 PR on llama.cpp’s side, and text-only GGUF inference might land before too long.

What was actually released

The Hugging Face collection has four entries.

Model	Total params	Active	Context	Position
MiMo-V2.5-Pro	1.02T	42B	1M	Text, agent, coding-focused
MiMo-V2.5-Pro-Base	1.02T	42B	256K	Pro’s base
MiMo-V2.5	310B	15B	1M	Text, image, video, audio omnimodal
MiMo-V2.5-Base	310B	15B	256K	Standard base

License is MIT.
That part is strong — commercial use and redistribution have very few constraints, so once a runtime catches up, this matters for the local crowd too.

The catch is size.
According to the Hugging Face API, MiMo-V2.5 is ~310.8B parameters and Pro is ~1.023T.
Both ship with FP8 weights, and even a 4-bit quant would land at roughly 150GB for the standard model and 500GB for Pro.

The official deployment recipes are SGLang and vLLM

The deployment section of the model card assumes SGLang and vLLM.
The official example for the standard MiMo-V2.5 uses SGLang with --tp-size 8, --dp-size 2, FP8 quantization, and FlashAttention 3.
The vLLM recipe explicitly says “stable vLLM doesn’t support MiMo V2.5 yet, use the dedicated Docker image” and assumes 4x H200 with TP4.

In other words, what’s officially supported is data-center GPUs, not personal hardware.
SGLang itself has AMD GPU support, but the official MiMo-V2.5 commands are CUDA / Hopper-leaning, and there’s no equivalent ROCm recipe yet.

So at this point, you can’t just “run the official commands” on my EVO-X2 or M1 Max.

On Mac, llama.cpp is the realistic entry point — not MLX

Recent things that worked well on Mac were running Qwen3.6-35B-A3B on M1 Max 64GB via Ollama and running Qwen3.6-35B-A3B on MLX, where it was 2x faster than Ollama.
But that’s 35B-A3B — 35B total with 3B active MoE — so it fits.

MiMo-V2.5 is 310B / 15B-active even for the standard model.
That’s roughly 9x the total parameters of Qwen3.6-35B-A3B, with 5x the active compute.
”It’s an MoE so it’s light” doesn’t carry the same meaning when the base scale is this far apart.

For MLX specifically, I couldn’t find any MLX-converted MiMo-V2.5 model or explicit mlx-lm support yet.
MiMo ships with custom_code, hybrid attention, FP8 weights, MTP, and the omnimodal-side encoders — not the kind of model where Unsloth pushes an MLX 4bit out the door the next day, the way Qwen does.

On the other hand, llama.cpp has a MiMo V2.5 PR open.
The PR adds text-inference support for MiMo V2.5 and Pro, but the audio and image components for the standard model are out of scope.
FP8 safetensors → GGUF conversion, TP-aware sharding, and fused attention_qkv handling are also being addressed.

The PR was unmerged as of April 29.
From the comments, conversion and quantization for the standard model is progressing, but Pro is still failing during conversion.
A realistic Mac entry point comes once the PR lands and text-only GGUF quants stabilize.

On ROCm, llama.cpp is more realistic than the official path

For ROCm, the closest reference point is running LLM-jp-4-32B-A3B on ROCm + Strix Halo.
The EVO-X2’s Radeon 8060S handled a 32B-A3B-class model at ~60 tok/s on llama.cpp’s ROCm backend.
That model was Q5_K_M at 24.4GB, with a 65K context, fitting in roughly 25GB total.

MiMo-V2.5 — even the standard model — could land at ~150GB for 4-bit.
The EVO-X2’s 64GB unified memory can’t load it before anything else.
Pro is even further out and falls outside the consumer-ROCm conversation entirely.

If you have multiple AMD Instincts the picture changes.
SGLang has AMD GPU support, and vLLM has a ROCm build.
But the Day 0 recipe for MiMo-V2.5 is H200-first; there’s no guarantee the same parallelism, FP8, attention kernel, and MoE communication translate cleanly to ROCm yet.

For consumer ROCm, waiting for llama.cpp’s GGUF support is more productive than the official SGLang/vLLM path.
Even then, the target is text inference on the standard MiMo-V2.5 — Pro and the omnimodal side are still off the table.

Running it on RunPod

If local won’t fly, that means cloud GPU.
RunPod has comparatively cheap GPUs and a vLLM template.

The vLLM recipe for MiMo-V2.5 wants 4x H200 (TP4) and FP8.
RunPod’s H200 has 141GB VRAM at $3.59/hr per GPU.
Four of them comes out to $14.36/hr — about $29 for a 2-hour test, around $115 for a full day.

If a 4-GPU H200 Pod isn’t available, here are the alternatives.

Setup	Total VRAM	Hourly	Notes
4x H200	564GB	~$14/hr	Matches the official recipe
4x H100 NVL	376GB	~$10/hr	FP8 weights take 310GB, leaves no KV cache headroom
8x H100 SXM	640GB	~$22/hr	Supports SGLang’s tp-size 8, but pricey

4x H100 NVL totals 376GB. Once you load the 310GB FP8 weights, only 66GB remains for KV cache.
A 1M context is out of the question, but it might fit short prompts for sanity checks.
Still, the recipe was validated on H200 — there’s no guarantee FP8 kernels and MoE routing pass cleanly on H100 NVL.

Pro is 1T parameters, ~1TB even at FP8.
Even 8x H200 (1128GB total) is borderline, and it’s not something an individual would casually try on RunPod.

One caveat: stable vLLM still doesn’t support MiMo-V2.5.
The official recipe uses the dedicated Docker image vllm/vllm-openai:mimov25-cu129.
RunPod Pods accept custom Docker images, so you can plug it in directly.

The 310B FP8 weights take tens of minutes to download from Hugging Face.
Inference doesn’t start the moment the Pod boots — that initial download time is also on the meter.

If your goal is “just touch it once” or “run it in my own environment instead of via API,” renting a 4x H200 for a few hours is the fastest path.

Is Google Cloud cheaper?

I quoted RunPod’s 4x H200 at $14.36/hr. How does that compare to Google Cloud (GCE) GPU instances?

GCE A3 instances start at 8 GPUs — there’s no 4-GPU option.
That said, on-demand 8x H100 (a3-highgpu-8g) is about $10.42/hr with 640GB total VRAM, so it’s both cheaper and bigger than RunPod’s 4x H200 ($14.36/hr, 564GB).

Setup	Total VRAM	Hourly	Notes
RunPod 4x H200	564GB	~$14/hr	Custom Docker, fast boot
GCE 8x H100 (a3-highgpu-8g)	640GB	~$10/hr	On-demand, cheapest US region
GCE 8x H100 Spot	640GB	~$1.6/hr	Spot, can get preempted any time
GCE 8x H200 (a3-ultragpu-8g)	1128GB	~$14/hr	On-demand
GCE 8x H200 Spot	1128GB	~$3.2/hr	Spot

Spot is dramatically cheaper.
8x H100 Spot is $1.62/hr — 2 hours costs $3.2.
RunPod’s 4x H200 for 2 hours is $29, so this lands at less than 1/10th.

But Spot can be preempted at any moment.
That’s fine for workloads that don’t care if they get killed mid-run — benchmarks, short prompt tests — but not for long-horizon agent runs or production serving.

The other wall is A3 instance availability.
GCE H100/H200 instances need a quota request; you can’t just sign up and rent one immediately.
RunPod, by contrast, lets you spin up a Pod the moment you sign up and load credit, assuming the SKU is in stock.
That “no application required, just go” gap matters more than it sounds.

If pure cost matters, GCE Spot wins easily.
If you want to “just touch it once right now,” RunPod is faster.
If you’re going to be experimenting repeatedly across multiple days, getting GCE quota approved is worth it.

Local-friendly alternatives

You can rent cloud GPUs to actually run MiMo-V2.5.
But if your goal is just “play with omnimodality or MoE locally,” other models running on your own machine are more pragmatic.

Goal	Current pick
Local LLM on M1 Max 64GB	Qwen3.6-35B-A3B
Speed-focused on Mac	Qwen3.6-35B-A3B MLX 4bit
ROCm + Strix Halo MoE	LLM-jp-4-32B-A3B
AMD mini-PC chat	EVO-X2 + LM Studio

To follow MiMo-V2.5 itself, the relevant places are roughly these.

Where to check	What you’ll find
Hugging Face collection	Official models, update dates, license, file layout
vLLM MiMo-V2.5 recipe	Stable support status, GPU requirements, dedicated Docker
llama.cpp PR #22493	GGUF conversion, the Mac/ROCm entry point, Pro support progress