Qwen3-TTS — Open-source speech synthesis with a single pip install

Open‑source TTS options have exploded and are hard to keep up with, but Qwen3‑TTS stands out for both ease of setup and Japanese quality. You don’t need ComfyUI or custom workflows—audio comes out with just a few lines of Python.

Repository: QwenLM/Qwen3-TTS
HuggingFace: Qwen/Qwen3-TTS-12Hz-1.7B-Base
Demo: Qwen/Qwen3-TTS
Paper: arXiv:2601.15621
License: Apache 2.0 (commercial use allowed)
Release date: January 22, 2026

Relationship to CosyVoice

There are two TTS project lines within Alibaba, so here’s a quick map.

Project	Team	Architecture
CosyVoice (1/2/3)	Tongyi Lab Speech Team	LLM + Flow Matching (DiT)
Qwen3-TTS	Qwen Team	Discrete multi‑codebook LM (no DiT)

CosyVoice has been under active development since 2024 and combines an LLM with DiT‑based flow matching. Qwen3‑TTS is built by a different team and adopts a discrete multi‑codebook LM architecture without DiT. The benchmarks explicitly compare against CosyVoice 3.

Model Variants

Model	Parameters	Capabilities	Instruction control
1.7B-CustomVoice	1.7B	9 preset voices + style control	Yes
1.7B-VoiceDesign	1.7B	Natural‑language voice design	Yes
1.7B-Base	1.7B	3‑second voice cloning + fine‑tuning	No
0.6B-CustomVoice	0.6B	Lightweight preset voices	No
0.6B-Base	0.6B	Lightweight voice cloning	No

All models support streaming generation. They use a dedicated tokenizer, Qwen3‑TTS‑Tokenizer‑12Hz (16‑layer multi‑codebook, 12 Hz sampling).

Three Modes

CustomVoice — Choose from nine high‑quality preset voices and synthesize speech. Also supports instructions for emotion and speaking style.

Preset voices:

Chinese: Vivian, Serena, Uncle_Fu
Dialects: Dylan (Beijing), Eric (Sichuan)
English: Ryan, Aiden
Japanese: Ono_Anna
Korean: Sohee

VoiceDesign — Design a voice in natural language, e.g., “a calm, low male voice” or “a bright, young female speaking style.” A key strength is generating rights‑clear voices without using someone else’s voice.

Base (voice cloning) — Clone a voice from a 3‑second reference. Fine‑tuning is also supported.

Architecture

Qwen3‑TTS adopts a discrete multi‑codebook LM architecture. Instead of the usual LLM + DiT (flow‑matching) cascade, it generates speech end‑to‑end.

Qwen3-TTS-Tokenizer-12Hz

A dedicated audio tokenizer. It samples at 12 Hz and uses a 16‑layer multi‑codebook to convert audio into a token sequence.

12 Hz = 12 frames per second; much lower than typical TTS tokenizers (25–50 Hz)
The 16‑layer codebook compensates for information at the lower frequency
Shorter sequences speed up LM inference

Dual‑Track Streaming

Streaming generation uses a Dual‑Track hybrid architecture. It can start audio output after just one character of input, with an initial packet latency of 97 ms (0.6B model).

Setup

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
pip install -U qwen-tts

# FlashAttention 2 (recommended for NVIDIA GPUs)
pip install -U flash-attn --no-build-isolation

That’s it. Model weights are downloaded automatically on first run.

Usage

Voice Cloning (Base)

Clone from a 3‑second reference and synthesize.

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_clone(
    text="こんにちは、これはテストです。",
    language="Japanese",
    ref_audio="reference.wav",
    ref_text="リファレンス音声のテキスト",
)
sf.write("output.wav", wavs[0], sr)

When generating multiple sentences with the same voice, reuse the prompt via create_voice_clone_prompt().

prompt_items = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="リファレンス音声のテキスト",
)

wavs, sr = model.generate_voice_clone(
    text=["1文目。", "2文目。"],
    language=["Japanese", "Japanese"],
    voice_clone_prompt=prompt_items,
)

Preset Voices (CustomVoice)

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_custom_voice(
    text="今日はいい天気ですね。",
    language="Japanese",
    speaker="Ono_Anna",
    instruct="穏やかに話してください",
)
sf.write("output.wav", wavs[0], sr)

Voice Design (VoiceDesign)

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="こんにちは。",
    language="Japanese",
    instruct="落ち着いた低い男性の声で、ゆっくり話す",
)
sf.write("output.wav", wavs[0], sr)

Web UI Demo

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000

If you want to record microphone input from the browser, enable HTTPS.

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=localhost"

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --ip 0.0.0.0 --port 8000 \
  --ssl-certfile cert.pem \
  --ssl-keyfile key.pem \
  --no-ssl-verify

System Requirements

Model	VRAM guideline	Notes
0.6B	4–8 GB	Runs on older GPUs
1.7B	≈16 GB	RTX 3090 / 4090 recommended

Apple Silicon: Works via MPS. Confirmed on an M3 MacBook Air (VoiceDesign 1.7B: ~4.2 GB, Base 0.6B: ~2.3 GB).
CPU: Works but not recommended. RTF around 3–5× (30 seconds of audio takes 90–150 seconds).
FlashAttention 2: Recommended but not required. Improves VRAM usage and speed.

Benchmarks

Seed‑TTS (WER↓, lower is better)

Model	test-zh	test-en
CosyVoice 3	0.71	1.45
MiniMax-Speech	0.83	1.65
F5-TTS	1.56	1.83
Qwen3-TTS-1.7B-Base	0.77	1.24

English is SOTA; Chinese is narrowly led by CosyVoice 3.

Multilingual (10 languages, WER↓)

Language	WER	Speaker similarity↑
Chinese	0.928	0.799
English	0.934	0.775
German	1.235	0.775
Italian	0.948	0.817
Korean	1.755	0.799
French	2.858	0.714

Lowest WER in 6/10 languages, and best speaker similarity across all 10.

Latency

Model	First packet
Qwen3-TTS	97 ms
OpenAI TTS	≈150 ms
ElevenLabs	≈200 ms

Comparison with Other Open‑Source TTS

Other TTS covered on this blog.

Model	Parameters	Japanese	Voice cloning	Setup
Qwen3-TTS	0.6B / 1.7B	✅	✅ (3 s)	pip install
KugelAudio	7B	❌	✅ (5–30 s)	ComfyUI
Pocket TTS	100M	❌	✅	pip install
Qwen3-Omni	30B (3B active)	✅	❌	-

Supports Japanese, offers voice cloning, and installs with a single pip command. This level of convenience is unique.

Qwen3-Omni: An omni‑modal model that unifies text, image, audio, and video with a 3B‑active MoE — Qwen3’s speech output goes through a Talker module
KugelAudio — 7B‑parameter open‑source TTS (ComfyUI compatible) — A TTS that uses Qwen as the LLM backbone
Pocket TTS — lightweight text‑to‑speech that runs on CPU — An ultra‑lightweight 100M TTS
Building a setup to talk with AI (1): Survey of speech APIs — Comparison of TTS APIs