Kimi K2.5: A 1-trillion-parameter MoE native multimodal agent model
Kimi K2.5, released by Moonshot AI on January 27, 2026, is intriguing. It’s a 1-trillion-parameter MoE model with native multimodal support and even parallel agent execution via Agent Swarm. It is open source under the MIT license, and the weights are available on Hugging Face.
Here are the technical points that stood out to me.
Architecture
It’s Transformer–MoE-based; the key specs are as follows.
| Item | Value |
|---|---|
| Total parameters | 1T (one trillion) |
| Active parameters | 32B |
| Layers | 61 (one dense layer) |
| Experts | 384 + 1 shared |
| Top-K | 8 |
| Attention hidden dim | 7168 |
| MoE hidden dim (per expert) | 2048 |
| Attention heads | 64 |
| Vocabulary size | 160K |
| Attention | Multi-head Latent Attention (MLA) |
| Activation | SwiGLU |
| Context length | 256K |
The standout is its Top-8 routing. Whereas many MoE models like Mixtral use Top-2, K2.5 activates eight experts simultaneously. This design favors representational richness over raw throughput, which is said to help with creative generation and nuance detection.
MoonViT
It ships with a custom vision encoder, MoonViT (400M parameters). Unlike VLMs that attach a vision tower to an LLM via a connector post hoc, its 15-trillion-token training corpus is a mixture of vision and text from the start.
Image features are compressed via spatial and temporal pooling and projected into the LLM’s embedding space. It natively handles images, video, and PDFs, and is strong at tasks like generating frontend code from UI design images or extracting workflows from videos.
Agent Swarm (PARL)
Agent Swarm is K2.5’s marquee feature. It is trained under a framework called Parallel-Agent Reinforcement Learning (PARL).
Mechanically, an orchestrator agent decomposes a task into parallelizable subtasks, dynamically spawns up to 100 sub-agents, and coordinates execution for up to 1,500 steps. There is no need to hand-craft roles or workflows in advance.
Serial Collapse problem
A tricky issue in training parallel orchestrators is “Serial Collapse”: despite having the ability to run in parallel, the orchestrator degenerates into single-agent sequential execution. PARL counters this with staged reward shaping and an annealing coefficient (λaux: 0.1 → 0.0).
Metric: Critical Steps
For Agent Swarm evaluation, they use Critical Steps (latency-oriented) rather than total step count.
CriticalSteps = Σ(Smain(t) + max_i Ssub,i(t))
Because it counts only the slowest subtask among those executed in parallel, it accurately reflects the effect of parallelization.
Performance
- Reduces end-to-end runtime by 80%
- 3–4.5× faster than single-agent execution
Benchmarks
Scores on major benchmarks.
| Benchmark | Score |
|---|---|
| HLE Full (text, with tools) | 51.8% |
| HLE Full (image, with tools) | 39.8% |
| BrowseComp (autonomous web interaction) | 60.2% |
| MMMU-Pro | 78.5% |
| AIME 2025 | 96.1% |
| SWE-Bench Verified | 76.8% |
| LiveCodeBench v6 | 85.0% |
On BrowseComp it sets a new world record at 60.2%, surpassing GPT-5 (54.9%). On HLE Full (with tools) it also exceeds GPT-5.2 and Claude 4.5 Opus.
Quantization and local execution
The MoE components use native INT4 quantization via Quantization-Aware Training (QAT). The key point is that 4-bit precision is incorporated during training rather than applied post hoc.
The INT4-quantized model weighs about 595 GB. There are reports of running it by connecting two Mac Studio M3 Ultra machines (512 GB RAM) with MLX’s mx.distributed. Support is also landing in LM Studio and Ollama.
API
An OpenAI/Anthropic-compatible API is provided.
| Item | Value |
|---|---|
| Input | $0.60 per million tokens |
| Output | $3.00 per million tokens |
| License | MIT |
On kimi.com, Instant/Thinking modes are available for free. It is also available via third-party platforms such as Fireworks AI, OpenRouter, and Together AI.