Tech 6 min read

Qwen3-TTS — Open-source speech synthesis with a single pip install

Open‑source TTS options have exploded and are hard to keep up with, but Qwen3‑TTS stands out for both ease of setup and Japanese quality. You don’t need ComfyUI or custom workflows—audio comes out with just a few lines of Python.

Relationship to CosyVoice

There are two TTS project lines within Alibaba, so here’s a quick map.

ProjectTeamArchitecture
CosyVoice (1/2/3)Tongyi Lab Speech TeamLLM + Flow Matching (DiT)
Qwen3-TTSQwen TeamDiscrete multi‑codebook LM (no DiT)

CosyVoice has been under active development since 2024 and combines an LLM with DiT‑based flow matching. Qwen3‑TTS is built by a different team and adopts a discrete multi‑codebook LM architecture without DiT. The benchmarks explicitly compare against CosyVoice 3.

Model Variants

ModelParametersCapabilitiesInstruction control
1.7B-CustomVoice1.7B9 preset voices + style controlYes
1.7B-VoiceDesign1.7BNatural‑language voice designYes
1.7B-Base1.7B3‑second voice cloning + fine‑tuningNo
0.6B-CustomVoice0.6BLightweight preset voicesNo
0.6B-Base0.6BLightweight voice cloningNo

All models support streaming generation. They use a dedicated tokenizer, Qwen3‑TTS‑Tokenizer‑12Hz (16‑layer multi‑codebook, 12 Hz sampling).

Three Modes

CustomVoice — Choose from nine high‑quality preset voices and synthesize speech. Also supports instructions for emotion and speaking style.

Preset voices:

  • Chinese: Vivian, Serena, Uncle_Fu
  • Dialects: Dylan (Beijing), Eric (Sichuan)
  • English: Ryan, Aiden
  • Japanese: Ono_Anna
  • Korean: Sohee

VoiceDesign — Design a voice in natural language, e.g., “a calm, low male voice” or “a bright, young female speaking style.” A key strength is generating rights‑clear voices without using someone else’s voice.

Base (voice cloning) — Clone a voice from a 3‑second reference. Fine‑tuning is also supported.

Architecture

Qwen3‑TTS adopts a discrete multi‑codebook LM architecture. Instead of the usual LLM + DiT (flow‑matching) cascade, it generates speech end‑to‑end.

Qwen3-TTS-Tokenizer-12Hz

A dedicated audio tokenizer. It samples at 12 Hz and uses a 16‑layer multi‑codebook to convert audio into a token sequence.

  • 12 Hz = 12 frames per second; much lower than typical TTS tokenizers (25–50 Hz)
  • The 16‑layer codebook compensates for information at the lower frequency
  • Shorter sequences speed up LM inference

Dual‑Track Streaming

Streaming generation uses a Dual‑Track hybrid architecture. It can start audio output after just one character of input, with an initial packet latency of 97 ms (0.6B model).

Setup

conda create -n qwen3-tts python=3.12 -y
conda activate qwen3-tts
pip install -U qwen-tts

# FlashAttention 2 (recommended for NVIDIA GPUs)
pip install -U flash-attn --no-build-isolation

That’s it. Model weights are downloaded automatically on first run.

Usage

Voice Cloning (Base)

Clone from a 3‑second reference and synthesize.

import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_clone(
    text="こんにちは、これはテストです。",
    language="Japanese",
    ref_audio="reference.wav",
    ref_text="リファレンス音声のテキスト",
)
sf.write("output.wav", wavs[0], sr)

When generating multiple sentences with the same voice, reuse the prompt via create_voice_clone_prompt().

prompt_items = model.create_voice_clone_prompt(
    ref_audio="reference.wav",
    ref_text="リファレンス音声のテキスト",
)

wavs, sr = model.generate_voice_clone(
    text=["1文目。", "2文目。"],
    language=["Japanese", "Japanese"],
    voice_clone_prompt=prompt_items,
)

Preset Voices (CustomVoice)

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_custom_voice(
    text="今日はいい天気ですね。",
    language="Japanese",
    speaker="Ono_Anna",
    instruct="穏やかに話してください",
)
sf.write("output.wav", wavs[0], sr)

Voice Design (VoiceDesign)

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

wavs, sr = model.generate_voice_design(
    text="こんにちは。",
    language="Japanese",
    instruct="落ち着いた低い男性の声で、ゆっくり話す",
)
sf.write("output.wav", wavs[0], sr)

Web UI Demo

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 0.0.0.0 --port 8000

If you want to record microphone input from the browser, enable HTTPS.

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes -subj "/CN=localhost"

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base \
  --ip 0.0.0.0 --port 8000 \
  --ssl-certfile cert.pem \
  --ssl-keyfile key.pem \
  --no-ssl-verify

System Requirements

ModelVRAM guidelineNotes
0.6B4–8 GBRuns on older GPUs
1.7B≈16 GBRTX 3090 / 4090 recommended
  • Apple Silicon: Works via MPS. Confirmed on an M3 MacBook Air (VoiceDesign 1.7B: ~4.2 GB, Base 0.6B: ~2.3 GB).
  • CPU: Works but not recommended. RTF around 3–5× (30 seconds of audio takes 90–150 seconds).
  • FlashAttention 2: Recommended but not required. Improves VRAM usage and speed.

Benchmarks

Seed‑TTS (WER↓, lower is better)

Modeltest-zhtest-en
CosyVoice 30.711.45
MiniMax-Speech0.831.65
F5-TTS1.561.83
Qwen3-TTS-1.7B-Base0.771.24

English is SOTA; Chinese is narrowly led by CosyVoice 3.

Multilingual (10 languages, WER↓)

LanguageWERSpeaker similarity↑
Chinese0.9280.799
English0.9340.775
German1.2350.775
Italian0.9480.817
Korean1.7550.799
French2.8580.714

Lowest WER in 6/10 languages, and best speaker similarity across all 10.

Latency

ModelFirst packet
Qwen3-TTS97 ms
OpenAI TTS≈150 ms
ElevenLabs≈200 ms

Comparison with Other Open‑Source TTS

Other TTS covered on this blog.

ModelParametersJapaneseVoice cloningSetup
Qwen3-TTS0.6B / 1.7B✅ (3 s)pip install
KugelAudio7B✅ (5–30 s)ComfyUI
Pocket TTS100Mpip install
Qwen3-Omni30B (3B active)-

Supports Japanese, offers voice cloning, and installs with a single pip command. This level of convenience is unique.