Qwen3-Omni: An omni-modal MoE that unifies text, image, speech, and video with 3B active parameters
Following early February’s Qwen3-Coder-Next, the Qwen team has released another intriguing model: Qwen3-Omni-30B-A3B. It accepts text, images, audio, and video as inputs, and responds in text and real-time speech. Like Coder-Next, it uses an MoE architecture where only 3B of the 30B parameters are active.
It’s becoming clear that the Qwen3 generation is doubling down on MoE. If Coder-Next is specialized for coding, Omni is about multimodal integration. Here are the technical highlights.
Thinker–Talker Architecture
The signature feature of Qwen3-Omni is a two-layer design: Thinker for reasoning and Talker for speech generation.
Thinker (reasoning engine)
An MoE-based LLM that integrates multimodal inputs and generates text.
| Item | Value |
|---|---|
| Total parameters | 30B |
| Active parameters | 3.3B |
| Layers | 48 |
| Experts | 128 |
| Active experts | 8 |
| Context length | 32K (extendable to 128K with YaRN) |
| Vocabulary size | 151,643 |
| Positional encoding | RoPE + QK normalization |
Whereas Coder-Next is 80B/3B (10 active out of 512 experts), Omni is 30B/3.3B (8 active out of 128 experts). The active parameter budget is nearly the same, but the expert configuration differs: Coder-Next selects 10 from a wide pool of 512 optimized for coding tasks, while Omni selects 8 from 128 to accommodate modality diversity.
Talker (speech generation)
A module that generates speech in real time from Thinker outputs. This is also MoE: 0.3B active out of 3B total.
- Multi-Token Prediction (MTP) module (~80M) predicts residual codebooks
- Code2Wav (~200M) converts to waveforms
- Runs asynchronously with Thinker and supports streaming output
- First-packet latency: 234 ms (audio), 547 ms (video)
- Available voices: Ethan (male, bright), Chelsie (female, calm), Aiden (male, composed)
Modality encoders
| Component | Parameters | Role |
|---|---|---|
| Audio Encoder (AuT) | 650M | Speech recognition and understanding; trained on 20 million hours of supervised data |
| Vision Encoder (SigLIP2) | 543M | Image/video understanding; inherited from Qwen3-VL |
| Code2Wav | 200M | Generates speech waveforms from the codec |
| MTP Module | 80M | Predicts residual codebooks |
In total, the system is about 35B parameters including encoders. Rather than add-on adapters, it’s trained from the pretraining stage with a staged mix that grows from text-only to multimodal data.
TM-RoPE
For positional encoding it adopts Time-aligned Multimodal RoPE (TM-RoPE), assigning rotary angles across three axes—time, height, and width (24/20/20). With unified timestamps for audio and video, it supports cross-modal reasoning such as “what is being said at this moment in the video”.
Three Variants
| Variant | Output | Notes |
|---|---|---|
| Instruct | Text + real-time speech | Uses both Thinker and Talker; conversation-oriented |
| Thinking | Text only (with CoT reasoning) | Thinker only; accuracy-focused |
| Captioner | Text only | Focused on speech captioning; low hallucination |
The Thinking variant improves accuracy via Chain-of-Thought, but for perception tasks (ASR, music recognition) there are reports that “reasoning can induce hallucinations”. The lineup is meant to be used selectively depending on the task.
Benchmarks
Text reasoning
| Benchmark | Instruct | Thinking | GPT-4o |
|---|---|---|---|
| MMLU-Redux | 86.6 | 88.8 | - |
| GPQA | 69.6 | 73.1 | 66.9 |
| AIME25 | 65.0 | 73.7 | 26.7 |
| ZebraLogic | 76.0 | - | 52.6 |
On AIME25, 73.7 versus 26.7 against GPT-4o is striking. It’s impressive to see a model with only ~3B active parameters reach this level of mathematical reasoning.
Speech recognition (WER; lower is better)
| Benchmark | Qwen3-Omni | GPT-4o-Transcribe |
|---|---|---|
| Librispeech clean | 1.22 | 1.39 |
| Librispeech other | 2.48 | - |
| Multilingual average (19 languages) | 5.33 | - |
Audio input supports 19 languages (including Japanese). Unlike Gemini Live API and OpenAI Realtime API compared in the earlier voice API survey, this processes audio natively rather than via an STT → LLM → TTS pipeline. Its direction is also close to full-duplex voice interaction in PersonaPlex.
Visual understanding
| Benchmark | Instruct | Thinking | GPT-4o |
|---|---|---|---|
| MMStar | 68.5 | 74.9 | - |
| MathVista | 75.9 | 80.0 | - |
| MATH-Vision | 56.3 | - | 38.1 |
| Video-MME | 70.5 | - | - |
Speech generation (zero-shot)
| Benchmark | Qwen3-Omni | Seed-TTS-RL |
|---|---|---|
| SEED test-en WER | 1.39 | 1.94 |
| SEED test-zh WER | 1.07 | 1.00 |
As an open-source model, it achieves SOTA on 32 of 36 benchmarks.
VRAM Requirements
Under BF16 precision, approximate usage:
| Variant | 15s video | 60s video | 120s video |
|---|---|---|---|
| Instruct (Thinker + Talker) | 79GB | 108GB | 145GB |
| Thinking (Thinker only) | 69GB | 96GB | 132GB |
If you only process text plus short audio, you can stay around 60–70GB, but video length inflates memory quickly. Disabling Talker saves about 10GB.
In my article on running Qwen-Image-Edit-2511 locally, even the 20B model had steep VRAM needs; Omni, at 30B plus encoders, is heavier. A single RTX 4090 is tough; A100 80GB or H100 are more realistic choices. When I ran Qwen Image Edit on an M1 Max, 64GB RAM was barely enough, so Omni without quantization will likely be hard on Apple Silicon as well.
MoE Strategy in the Qwen3 Generation
Over the past few weeks, the overall shape of the Qwen3 family has come into focus.
| Model | Use case | Total params | Active | Experts |
|---|---|---|---|---|
| Qwen3-Coder-Next | Coding | 80B | 3B | 512 |
| Qwen3-Omni | Omni-modal | 30B | 3.3B | 128 |
| Qwen3-235B-A22B | General text | 235B | 22B | - |
Notably, Coder-Next and Omni both target roughly 3B active parameters. This unifies inference cost while changing specialization via expert layout and training data.
In contrast to Kimi K2.5, which pushes a brute-force 32B active out of 1T parameters, Qwen pursues “how far can small active parameter budgets go”. In fact, the several benchmarks where Omni’s 3.3B-active outperforms GPT-4o suggest efficient MoE routing.
Training Process
Pretraining (3 stages)
- Encoder alignment: Freeze the LLM and train the audio/vision encoders on paired text data for each modality.
- General training: About 2 trillion tokens (text 0.57T, audio 0.77T, image 0.82T, video 0.1T).
- Long context: Extend max sequence length from 8,192 to 32,768.
Post-training (Thinker)
- SFT (Supervised Fine-Tuning)
- Strong-to-Weak distillation (using Qwen3-32B and Qwen3-235B-A22B as teacher models)
- GSPO optimization (rule-based rewards + LLM-as-a-Judge)
Qwen3-235B-A22B serves as a teacher model, establishing a knowledge distillation pipeline within the family.
Thoughts
Omni arrived right after I wrote about Coder-Next, making Qwen3’s MoE direction suddenly much clearer. Building text, coding, and multimodal specialization on the same MoE framework is a sensible approach.
Personally I’m most curious about the speech integration. In the earlier voice API survey the pipeline was STT/LLM/TTS stitched together, but native, integrated approaches like Omni shift the architecture itself. Lower latency and preserved cross-modal context are big advantages.
That said, 69GB+ VRAM isn’t easy to tinker with locally. Coder-Next ran on an RTX 4090, but Omni likely assumes cloud or multi-GPU. Renting an A100 on RunPod seems realistic. If quantized variants appear, the picture could change.