Tech 3 min read

Kimi K2.5: A 1-trillion-parameter MoE native multimodal agent model

Kimi K2.5, released by Moonshot AI on January 27, 2026, is intriguing. It’s a 1-trillion-parameter MoE model with native multimodal support and even parallel agent execution via Agent Swarm. It is open source under the MIT license, and the weights are available on Hugging Face.

Here are the technical points that stood out to me.

Architecture

It’s Transformer–MoE-based; the key specs are as follows.

ItemValue
Total parameters1T (one trillion)
Active parameters32B
Layers61 (one dense layer)
Experts384 + 1 shared
Top-K8
Attention hidden dim7168
MoE hidden dim (per expert)2048
Attention heads64
Vocabulary size160K
AttentionMulti-head Latent Attention (MLA)
ActivationSwiGLU
Context length256K

The standout is its Top-8 routing. Whereas many MoE models like Mixtral use Top-2, K2.5 activates eight experts simultaneously. This design favors representational richness over raw throughput, which is said to help with creative generation and nuance detection.

MoonViT

It ships with a custom vision encoder, MoonViT (400M parameters). Unlike VLMs that attach a vision tower to an LLM via a connector post hoc, its 15-trillion-token training corpus is a mixture of vision and text from the start.

Image features are compressed via spatial and temporal pooling and projected into the LLM’s embedding space. It natively handles images, video, and PDFs, and is strong at tasks like generating frontend code from UI design images or extracting workflows from videos.

Agent Swarm (PARL)

Agent Swarm is K2.5’s marquee feature. It is trained under a framework called Parallel-Agent Reinforcement Learning (PARL).

Mechanically, an orchestrator agent decomposes a task into parallelizable subtasks, dynamically spawns up to 100 sub-agents, and coordinates execution for up to 1,500 steps. There is no need to hand-craft roles or workflows in advance.

Serial Collapse problem

A tricky issue in training parallel orchestrators is “Serial Collapse”: despite having the ability to run in parallel, the orchestrator degenerates into single-agent sequential execution. PARL counters this with staged reward shaping and an annealing coefficient (λaux: 0.1 → 0.0).

Metric: Critical Steps

For Agent Swarm evaluation, they use Critical Steps (latency-oriented) rather than total step count.

CriticalSteps = Σ(Smain(t) + max_i Ssub,i(t))

Because it counts only the slowest subtask among those executed in parallel, it accurately reflects the effect of parallelization.

Performance

  • Reduces end-to-end runtime by 80%
  • 3–4.5× faster than single-agent execution

Benchmarks

Scores on major benchmarks.

BenchmarkScore
HLE Full (text, with tools)51.8%
HLE Full (image, with tools)39.8%
BrowseComp (autonomous web interaction)60.2%
MMMU-Pro78.5%
AIME 202596.1%
SWE-Bench Verified76.8%
LiveCodeBench v685.0%

On BrowseComp it sets a new world record at 60.2%, surpassing GPT-5 (54.9%). On HLE Full (with tools) it also exceeds GPT-5.2 and Claude 4.5 Opus.

Quantization and local execution

The MoE components use native INT4 quantization via Quantization-Aware Training (QAT). The key point is that 4-bit precision is incorporated during training rather than applied post hoc.

The INT4-quantized model weighs about 595 GB. There are reports of running it by connecting two Mac Studio M3 Ultra machines (512 GB RAM) with MLX’s mx.distributed. Support is also landing in LM Studio and Ollama.

API

An OpenAI/Anthropic-compatible API is provided.

ItemValue
Input$0.60 per million tokens
Output$3.00 per million tokens
LicenseMIT

On kimi.com, Instant/Thinking modes are available for free. It is also available via third-party platforms such as Fireworks AI, OpenRouter, and Together AI.

References