Qwen3-TTS — Open-source speech synthesis with a single pip install
Open‑source TTS options have exploded and are hard to keep up with, but Qwen3‑TTS stands out for both ease of setup and Japanese quality. You don’t need ComfyUI or custom workflows—audio comes out with just a few lines of Python.
- Repository: QwenLM/Qwen3-TTS
- HuggingFace: Qwen/Qwen3-TTS-12Hz-1.7B-Base
- Demo: Qwen/Qwen3-TTS
- Paper: arXiv:2601.15621
- License: Apache 2.0 (commercial use allowed)
- Release date: January 22, 2026
Relationship to CosyVoice
There are two TTS project lines within Alibaba, so here’s a quick map.
| Project | Team | Architecture |
|---|---|---|
| CosyVoice (1/2/3) | Tongyi Lab Speech Team | LLM + Flow Matching (DiT) |
| Qwen3-TTS | Qwen Team | Discrete multi‑codebook LM (no DiT) |
CosyVoice has been under active development since 2024 and combines an LLM with DiT‑based flow matching. Qwen3‑TTS is built by a different team and adopts a discrete multi‑codebook LM architecture without DiT. The benchmarks explicitly compare against CosyVoice 3.
Model Variants
| Model | Parameters | Capabilities | Instruction control |
|---|---|---|---|
| 1.7B-CustomVoice | 1.7B | 9 preset voices + style control | Yes |
| 1.7B-VoiceDesign | 1.7B | Natural‑language voice design | Yes |
| 1.7B-Base | 1.7B | 3‑second voice cloning + fine‑tuning | No |
| 0.6B-CustomVoice | 0.6B | Lightweight preset voices | No |
| 0.6B-Base | 0.6B | Lightweight voice cloning | No |
All models support streaming generation. They use a dedicated tokenizer, Qwen3‑TTS‑Tokenizer‑12Hz (16‑layer multi‑codebook, 12 Hz sampling).
Three Modes
CustomVoice — Choose from nine high‑quality preset voices and synthesize speech. Also supports instructions for emotion and speaking style.
Preset voices:
- Chinese: Vivian, Serena, Uncle_Fu
- Dialects: Dylan (Beijing), Eric (Sichuan)
- English: Ryan, Aiden
- Japanese: Ono_Anna
- Korean: Sohee
VoiceDesign — Design a voice in natural language, e.g., “a calm, low male voice” or “a bright, young female speaking style.” A key strength is generating rights‑clear voices without using someone else’s voice.
Base (voice cloning) — Clone a voice from a 3‑second reference. Fine‑tuning is also supported.
Architecture
Qwen3‑TTS adopts a discrete multi‑codebook LM architecture. Instead of the usual LLM + DiT (flow‑matching) cascade, it generates speech end‑to‑end.
Qwen3-TTS-Tokenizer-12Hz
A dedicated audio tokenizer. It samples at 12 Hz and uses a 16‑layer multi‑codebook to convert audio into a token sequence.
- 12 Hz = 12 frames per second; much lower than typical TTS tokenizers (25–50 Hz)
- The 16‑layer codebook compensates for information at the lower frequency
- Shorter sequences speed up LM inference
Dual‑Track Streaming
Streaming generation uses a Dual‑Track hybrid architecture. It can start audio output after just one character of input, with an initial packet latency of 97 ms (0.6B model).
Setup
conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
pip install -U qwen-tts
# FlashAttention 2 (recommended for NVIDIA GPUs)
pip install -U flash-attn --no-build-isolation
That’s it. Model weights are downloaded automatically on first run.
Usage
Voice Cloning (Base)
Clone from a 3‑second reference and synthesize.
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-Base",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_voice_clone(
text="こんにちは、これはテストです。",
language="Japanese",
ref_audio="reference.wav",
ref_text="リファレンス音声のテキスト",
)
sf.write("output.wav", wavs[0], sr)
When generating multiple sentences with the same voice, reuse the prompt via create_voice_clone_prompt().
prompt_items = model.create_voice_clone_prompt(
ref_audio="reference.wav",
ref_text="リファレンス音声のテキスト",
)
wavs, sr = model.generate_voice_clone(
text=["1文目。", "2文目。"],
language=["Japanese", "Japanese"],
voice_clone_prompt=prompt_items,
)
Preset Voices (CustomVoice)
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_custom_voice(
text="今日はいい天気ですね。",
language="Japanese",
speaker="Ono_Anna",
instruct="穏やかに話してください",
)
sf.write("output.wav", wavs[0], sr)
Voice Design (VoiceDesign)
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
wavs, sr = model.generate_voice_design(
text="こんにちは。",
language="Japanese",
instruct="落ち着いた低い男性の声で、ゆっくり話す",
)
sf.write("output.wav", wavs[0], sr)
Web UI Demo
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000
If you want to record microphone input from the browser, enable HTTPS.
openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=localhost"
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--ip 0.0.0.0 --port 8000 \
--ssl-certfile cert.pem \
--ssl-keyfile key.pem \
--no-ssl-verify
System Requirements
| Model | VRAM guideline | Notes |
|---|---|---|
| 0.6B | 4–8 GB | Runs on older GPUs |
| 1.7B | ≈16 GB | RTX 3090 / 4090 recommended |
- Apple Silicon: Works via MPS. Confirmed on an M3 MacBook Air (VoiceDesign 1.7B: ~4.2 GB, Base 0.6B: ~2.3 GB).
- CPU: Works but not recommended. RTF around 3–5× (30 seconds of audio takes 90–150 seconds).
- FlashAttention 2: Recommended but not required. Improves VRAM usage and speed.
Benchmarks
Seed‑TTS (WER↓, lower is better)
| Model | test-zh | test-en |
|---|---|---|
| CosyVoice 3 | 0.71 | 1.45 |
| MiniMax-Speech | 0.83 | 1.65 |
| F5-TTS | 1.56 | 1.83 |
| Qwen3-TTS-1.7B-Base | 0.77 | 1.24 |
English is SOTA; Chinese is narrowly led by CosyVoice 3.
Multilingual (10 languages, WER↓)
| Language | WER | Speaker similarity↑ |
|---|---|---|
| Chinese | 0.928 | 0.799 |
| English | 0.934 | 0.775 |
| German | 1.235 | 0.775 |
| Italian | 0.948 | 0.817 |
| Korean | 1.755 | 0.799 |
| French | 2.858 | 0.714 |
Lowest WER in 6/10 languages, and best speaker similarity across all 10.
Latency
| Model | First packet |
|---|---|
| Qwen3-TTS | 97 ms |
| OpenAI TTS | ≈150 ms |
| ElevenLabs | ≈200 ms |
Comparison with Other Open‑Source TTS
Other TTS covered on this blog.
| Model | Parameters | Japanese | Voice cloning | Setup |
|---|---|---|---|---|
| Qwen3-TTS | 0.6B / 1.7B | ✅ | ✅ (3 s) | pip install |
| KugelAudio | 7B | ❌ | ✅ (5–30 s) | ComfyUI |
| Pocket TTS | 100M | ❌ | ✅ | pip install |
| Qwen3-Omni | 30B (3B active) | ✅ | ❌ | - |
Supports Japanese, offers voice cloning, and installs with a single pip command. This level of convenience is unique.
Related Links
- Qwen3-Omni: An omni‑modal model that unifies text, image, audio, and video with a 3B‑active MoE — Qwen3’s speech output goes through a Talker module
- KugelAudio — 7B‑parameter open‑source TTS (ComfyUI compatible) — A TTS that uses Qwen as the LLM backbone
- Pocket TTS — lightweight text‑to‑speech that runs on CPU — An ultra‑lightweight 100M TTS
- Building a setup to talk with AI (1): Survey of speech APIs — Comparison of TTS APIs