Kimi K2.5: A 1-trillion-parameter MoE native multimodal agent model

Kimi K2.5, released by Moonshot AI on January 27, 2026, is intriguing. It’s a 1-trillion-parameter MoE model with native multimodal support and even parallel agent execution via Agent Swarm. It is open source under the MIT license, and the weights are available on Hugging Face.

Here are the technical points that stood out to me.

Architecture

It’s Transformer–MoE-based; the key specs are as follows.

Item	Value
Total parameters	1T (one trillion)
Active parameters	32B
Layers	61 (one dense layer)
Experts	384 + 1 shared
Top-K	8
Attention hidden dim	7168
MoE hidden dim (per expert)	2048
Attention heads	64
Vocabulary size	160K
Attention	Multi-head Latent Attention (MLA)
Activation	SwiGLU
Context length	256K

The standout is its Top-8 routing. Whereas many MoE models like Mixtral use Top-2, K2.5 activates eight experts simultaneously. This design favors representational richness over raw throughput, which is said to help with creative generation and nuance detection.

MoonViT

It ships with a custom vision encoder, MoonViT (400M parameters). Unlike VLMs that attach a vision tower to an LLM via a connector post hoc, its 15-trillion-token training corpus is a mixture of vision and text from the start.

Image features are compressed via spatial and temporal pooling and projected into the LLM’s embedding space. It natively handles images, video, and PDFs, and is strong at tasks like generating frontend code from UI design images or extracting workflows from videos.

Agent Swarm (PARL)

Agent Swarm is K2.5’s marquee feature. It is trained under a framework called Parallel-Agent Reinforcement Learning (PARL).

Mechanically, an orchestrator agent decomposes a task into parallelizable subtasks, dynamically spawns up to 100 sub-agents, and coordinates execution for up to 1,500 steps. There is no need to hand-craft roles or workflows in advance.

Serial Collapse problem

A tricky issue in training parallel orchestrators is “Serial Collapse”: despite having the ability to run in parallel, the orchestrator degenerates into single-agent sequential execution. PARL counters this with staged reward shaping and an annealing coefficient (λaux: 0.1 → 0.0).

Metric: Critical Steps

For Agent Swarm evaluation, they use Critical Steps (latency-oriented) rather than total step count.

CriticalSteps = Σ(Smain(t) + max_i Ssub,i(t))

Because it counts only the slowest subtask among those executed in parallel, it accurately reflects the effect of parallelization.

Performance

Reduces end-to-end runtime by 80%
3–4.5× faster than single-agent execution

Benchmarks

Scores on major benchmarks.

Benchmark	Score
HLE Full (text, with tools)	51.8%
HLE Full (image, with tools)	39.8%
BrowseComp (autonomous web interaction)	60.2%
MMMU-Pro	78.5%
AIME 2025	96.1%
SWE-Bench Verified	76.8%
LiveCodeBench v6	85.0%

On BrowseComp it sets a new world record at 60.2%, surpassing GPT-5 (54.9%). On HLE Full (with tools) it also exceeds GPT-5.2 and Claude 4.5 Opus.

Quantization and local execution

The MoE components use native INT4 quantization via Quantization-Aware Training (QAT). The key point is that 4-bit precision is incorporated during training rather than applied post hoc.

The INT4-quantized model weighs about 595 GB. There are reports of running it by connecting two Mac Studio M3 Ultra machines (512 GB RAM) with MLX’s mx.distributed. Support is also landing in LM Studio and Ollama.

API

An OpenAI/Anthropic-compatible API is provided.

Item	Value
Input	$0.60 per million tokens
Output	$3.00 per million tokens
License	MIT

On kimi.com, Instant/Thinking modes are available for free. It is also available via third-party platforms such as Fireworks AI, OpenRouter, and Together AI.