Qwen3-Omni: An omni-modal MoE that unifies text, image, speech, and video with 3B active parameters

Following early February’s Qwen3-Coder-Next, the Qwen team has released another intriguing model: Qwen3-Omni-30B-A3B. It accepts text, images, audio, and video as inputs, and responds in text and real-time speech. Like Coder-Next, it uses an MoE architecture where only 3B of the 30B parameters are active.

It’s becoming clear that the Qwen3 generation is doubling down on MoE. If Coder-Next is specialized for coding, Omni is about multimodal integration. Here are the technical highlights.

Thinker–Talker Architecture

The signature feature of Qwen3-Omni is a two-layer design: Thinker for reasoning and Talker for speech generation.

Thinker (reasoning engine)

An MoE-based LLM that integrates multimodal inputs and generates text.

Item	Value
Total parameters	30B
Active parameters	3.3B
Layers	48
Experts	128
Active experts	8
Context length	32K (extendable to 128K with YaRN)
Vocabulary size	151,643
Positional encoding	RoPE + QK normalization

Whereas Coder-Next is 80B/3B (10 active out of 512 experts), Omni is 30B/3.3B (8 active out of 128 experts). The active parameter budget is nearly the same, but the expert configuration differs: Coder-Next selects 10 from a wide pool of 512 optimized for coding tasks, while Omni selects 8 from 128 to accommodate modality diversity.

Talker (speech generation)

A module that generates speech in real time from Thinker outputs. This is also MoE: 0.3B active out of 3B total.

Multi-Token Prediction (MTP) module (~80M) predicts residual codebooks
Code2Wav (~200M) converts to waveforms
Runs asynchronously with Thinker and supports streaming output
First-packet latency: 234 ms (audio), 547 ms (video)
Available voices: Ethan (male, bright), Chelsie (female, calm), Aiden (male, composed)

Modality encoders

Component	Parameters	Role
Audio Encoder (AuT)	650M	Speech recognition and understanding; trained on 20 million hours of supervised data
Vision Encoder (SigLIP2)	543M	Image/video understanding; inherited from Qwen3-VL
Code2Wav	200M	Generates speech waveforms from the codec
MTP Module	80M	Predicts residual codebooks

In total, the system is about 35B parameters including encoders. Rather than add-on adapters, it’s trained from the pretraining stage with a staged mix that grows from text-only to multimodal data.

TM-RoPE

For positional encoding it adopts Time-aligned Multimodal RoPE (TM-RoPE), assigning rotary angles across three axes—time, height, and width (24/20/20). With unified timestamps for audio and video, it supports cross-modal reasoning such as “what is being said at this moment in the video”.

Three Variants

Variant	Output	Notes
Instruct	Text + real-time speech	Uses both Thinker and Talker; conversation-oriented
Thinking	Text only (with CoT reasoning)	Thinker only; accuracy-focused
Captioner	Text only	Focused on speech captioning; low hallucination

The Thinking variant improves accuracy via Chain-of-Thought, but for perception tasks (ASR, music recognition) there are reports that “reasoning can induce hallucinations”. The lineup is meant to be used selectively depending on the task.

Benchmarks

Text reasoning

Benchmark	Instruct	Thinking	GPT-4o
MMLU-Redux	86.6	88.8	-
GPQA	69.6	73.1	66.9
AIME25	65.0	73.7	26.7
ZebraLogic	76.0	-	52.6

On AIME25, 73.7 versus 26.7 against GPT-4o is striking. It’s impressive to see a model with only ~3B active parameters reach this level of mathematical reasoning.

Speech recognition (WER; lower is better)

Benchmark	Qwen3-Omni	GPT-4o-Transcribe
Librispeech clean	1.22	1.39
Librispeech other	2.48	-
Multilingual average (19 languages)	5.33	-

Audio input supports 19 languages (including Japanese). Unlike Gemini Live API and OpenAI Realtime API compared in the earlier voice API survey, this processes audio natively rather than via an STT → LLM → TTS pipeline. Its direction is also close to full-duplex voice interaction in PersonaPlex.

Visual understanding

Benchmark	Instruct	Thinking	GPT-4o
MMStar	68.5	74.9	-
MathVista	75.9	80.0	-
MATH-Vision	56.3	-	38.1
Video-MME	70.5	-	-

Speech generation (zero-shot)

Benchmark	Qwen3-Omni	Seed-TTS-RL
SEED test-en WER	1.39	1.94
SEED test-zh WER	1.07	1.00

As an open-source model, it achieves SOTA on 32 of 36 benchmarks.

VRAM Requirements

Under BF16 precision, approximate usage:

Variant	15s video	60s video	120s video
Instruct (Thinker + Talker)	79GB	108GB	145GB
Thinking (Thinker only)	69GB	96GB	132GB

If you only process text plus short audio, you can stay around 60–70GB, but video length inflates memory quickly. Disabling Talker saves about 10GB.

In my article on running Qwen-Image-Edit-2511 locally, even the 20B model had steep VRAM needs; Omni, at 30B plus encoders, is heavier. A single RTX 4090 is tough; A100 80GB or H100 are more realistic choices. When I ran Qwen Image Edit on an M1 Max, 64GB RAM was barely enough, so Omni without quantization will likely be hard on Apple Silicon as well.

MoE Strategy in the Qwen3 Generation

Over the past few weeks, the overall shape of the Qwen3 family has come into focus.

Model	Use case	Total params	Active	Experts
Qwen3-Coder-Next	Coding	80B	3B	512
Qwen3-Omni	Omni-modal	30B	3.3B	128
Qwen3-235B-A22B	General text	235B	22B	-

Notably, Coder-Next and Omni both target roughly 3B active parameters. This unifies inference cost while changing specialization via expert layout and training data.

In contrast to Kimi K2.5, which pushes a brute-force 32B active out of 1T parameters, Qwen pursues “how far can small active parameter budgets go”. In fact, the several benchmarks where Omni’s 3.3B-active outperforms GPT-4o suggest efficient MoE routing.

Training Process

Pretraining (3 stages)

Encoder alignment: Freeze the LLM and train the audio/vision encoders on paired text data for each modality.
General training: About 2 trillion tokens (text 0.57T, audio 0.77T, image 0.82T, video 0.1T).
Long context: Extend max sequence length from 8,192 to 32,768.

Post-training (Thinker)

SFT (Supervised Fine-Tuning)
Strong-to-Weak distillation (using Qwen3-32B and Qwen3-235B-A22B as teacher models)
GSPO optimization (rule-based rewards + LLM-as-a-Judge)

Qwen3-235B-A22B serves as a teacher model, establishing a knowledge distillation pipeline within the family.

Thoughts

Omni arrived right after I wrote about Coder-Next, making Qwen3’s MoE direction suddenly much clearer. Building text, coding, and multimodal specialization on the same MoE framework is a sensible approach.

Personally I’m most curious about the speech integration. In the earlier voice API survey the pipeline was STT/LLM/TTS stitched together, but native, integrated approaches like Omni shift the architecture itself. Lower latency and preserved cross-modal context are big advantages.

That said, 69GB+ VRAM isn’t easy to tinker with locally. Coder-Next ran on an RTX 4090, but Omni likely assumes cloud or multi-GPU. Renting an A100 on RunPod seems realistic. If quantized variants appear, the picture could change.