ZONOS2 on an 8GB RTX 4060 Laptop (WSL2): it runs, but ~20x slower than realtime
Contents
I ran Zyphra’s ZONOS2 locally on an 8GB RTX 4060 Laptop (WSL2) and got it to generate Japanese audio.
The catch: the 15.3GB weights only fit by spilling into Windows shared GPU memory, you have to override the KV-cache page count by hand, you need the CUDA toolkit for the JIT kernels, and generation lands around 1/20 of realtime. The walk-up to that result is below.
Zyphra released ZONOS2 on June 12, 2026.
It’s an MoE TTS with 8B total parameters and 900M active at inference, and the weights are on Hugging Face.
Japanese is Tier 1 alongside English and Chinese, but the comparison clips on the official blog are mostly English, and there’s no ready-made Japanese audio file to drop straight into a post.
The one Japanese sample I could find is the example sentence in the Hugging Face Space multimodalart/ZONOS2.
私は数秒の音声からどんな声でも再現できます。
That’s an example for generating on the Space, not a pre-generated audio file.
As of June 14, 2026, YouTube searches didn’t surface an easy-to-reference Japanese audio demo for ZONOS2 either; the results were mostly old Zonos v0.1 English tutorials.
ZONOS2 in brief
ZONOS2 generates DAC tokens autoregressively from text and turns them back into 44.1kHz audio.
The official blog describes it like this.
| Item | Detail |
|---|---|
| Model size | 8B total params, 900M active |
| Architecture | Sparse MoE |
| Training audio | 6M+ hours |
| Output | 44.1kHz audio via DAC |
| Voice cloning | Conditioned on a 2048-dim speaker embedding (the code and config use a Qwen3 voice embedding; the official overview text says ECAPA-TDNN, which doesn’t match) |
| License | Apache 2.0 per the model card |
| Max generation | The blog says up to 1 minute of multilingual, code-switching audio |
Hugging Face’s params.json lists 28 layers, hidden dim 2048, 9 codebooks, codebook size 1024, and a max sequence length of 6144.
The MoE has 16 experts, top-k 1, with only layer 26 using top-k 2.
The main checkpoint, model.pth, is about 15.3GB.
With 15.3GB of weights and a CUDA-only implementation, the default mental model is an open-weight TTS you run on a CUDA server.
Japanese is Tier 1
The Hugging Face and GitHub READMEs split supported languages into three tiers.
Tier 1 is English, Chinese, and Japanese. Tier 2 adds Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, and Dutch.
The API also has a language parameter, with ja available as the text-normalization setting.
{
"text": "私は数秒の音声からどんな声でも再現できます。",
"language": "ja",
"stream": true
}
The reason Japanese gets first-class treatment in ZONOS2 comes from a change since the old Zonos v0.1.
Per Zyphra, ZONOS2 doesn’t rely on explicit phonemization and instead uses raw UTF-8 bytes as its input representation.
They write that this cuts down on errors from phonemization dictionaries and language labels, and improves handling of non-European languages like Chinese, Korean, and Japanese.
And because it doesn’t depend on fixed language tokens, it handles mid-sentence language switching more easily too.
The official blog’s audio samples are mostly English
Zyphra’s official blog has comparison clips lining up ZONOS2, Fish Audio, Qwen, Cartesia, ElevenLabs, and others.
I could pull 33 audio assets from the page. The displayed speakers/prompts were things like Dwarkesh, Trump, British Female, Parks and Recreation Guy, David Attenborough, Arlechino, and Obama, and the visible text was English.
The official blog is an audio-comparison page, but it’s not a place to grab a fixed Japanese sample URL.
The Tier 1 Japanese claim has to be checked against the model card, the README, the Space’s example sentences, or audio you generate yourself.
The Hugging Face Spaces have Japanese example text
As of June 14, 2026, there are two Spaces using ZONOS2 on Hugging Face.
| Space | Status | Japanese clue |
|---|---|---|
multimodalart/ZONOS2 | Running on Zero | Japanese in the language dropdown; Japanese in the examples |
Mike0021/zonos2 | ZeroGPU | ja in the language dropdown; examples are English and French |
multimodalart/ZONOS2’s app.py has "Japanese": "ja" in its language map, and this line in the examples.
["私は数秒の音声からどんな声でも再現できます。", "Japanese"]
That was the only Japanese example I found as of June 14, 2026.
But it’s the kind you run through the Space UI to generate audio, not a fixed-URL audio file you can paste straight into a post like the official blog’s clips.
Local execution assumes Linux and CUDA
The README’s Quick Start says the only supported platform is Linux x86_64, and you need an NVIDIA GPU with the CUDA Toolkit.
This isn’t the kind of local LLM where you grab a GGUF and try it on Apple Silicon.
The GitHub README’s launch example looks like this.
git clone https://github.com/Zyphra/Zonos2.git
cd Zonos2
uv sync
uv run python -m zonos2 --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/
The server comes up at http://localhost:1919 by default.
To send Japanese, put ja in language.
curl -X POST http://localhost:1919/tts/generate \
-H "Content-Type: application/json" \
-d '{"text":"私は数秒の音声からどんな声でも再現できます。","language":"ja","stream":true}' \
--output zonos2-ja.pcm
The response is float32 PCM, 44.1kHz, mono. Convert to WAV like this.
ffmpeg -f f32le -ar 44100 -ac 1 -i zonos2-ja.pcm zonos2-ja.wav
The Hugging Face model card’s launch command is python -m minisgl, while the GitHub README uses python -m zonos2.
As of June 14, 2026 the two disagree, so check both the latest GitHub README and the Space implementation when you run it.
Actually running it on an 8GB RTX 4060 Laptop
When the docs only say “Linux x86_64 + CUDA”, an 8GB laptop looks hopeless.
Running it on an RTX 4060 Laptop (8GB VRAM) under Windows 11 + WSL2, I got stuck twice along the way, but it eventually generated Japanese audio.
Test setup:
- Windows 11 + WSL2 (Ubuntu 22.04)
- NVIDIA GeForce RTX 4060 Laptop GPU (8GB dedicated VRAM)
- 31.7GB host RAM (WSL gets ~15.8GB by default)
- torch 2.9.1+cu128, model in bf16
A naive load stops at the KV cache
Loading directly with TTSLLM(model_path="Zyphra/ZONOS2"), the bf16 weights are about 14.3GiB and don’t fit in 8GB of physical VRAM.
It doesn’t crash, though, because the WSL2 NVIDIA driver has a “system-memory fallback”: anything past VRAM spills automatically into shared GPU memory (the host’s main RAM), and cudaMalloc succeeds even past 8GB.
| Metric | Value |
|---|---|
| GPU memory PyTorch requested | 14.287 GiB |
| Physical VRAM used | 7.95GB (pinned at the 8GB ceiling, 0 free) |
| Spill into shared GPU memory (measured peak on Windows) | 7.12GB |
| WSL RAM peak during the CPU-side load | ~15.4GB (of 15.8GB) |
Because the implementation uses torch.load(map_location="cpu") to expand the 15.3GB checkpoint into system RAM first and then move it to the GPU, the load nearly maxes out WSL’s RAM too.
But it stops here. The error isn’t an OOM crash but this assertion.
AssertionError: Not enough memory for KV cache, try reducing --num-tokens
It’s not that the KV cache (the working area that holds each token’s key/value during generation) is too big.
The weights ate all 8GB of physical VRAM first, leaving 0 physical VRAM for the KV cache, so the engine decided it couldn’t allocate even one page and stopped.
Override the KV-cache page count to get past it
Reading engine/config.py, there’s an override parameter for the KV-cache page count.
num_page_override: int | None = None # if not None, will override the number of pages
The automatic calculation derives the page count from “free memory before load − weight size”, so when the weights are larger it goes negative and dies on the assert.
Setting num_page_override explicitly skips that calculation entirely.
tts = TTSLLM(model_path="Zyphra/ZONOS2", dtype=torch.bfloat16,
num_page_override=4096, max_running_req=1)
With that, the KV cache (4096 pages = 0.22GiB) also lands in shared memory, and the assert passes.
The next wall was different.
RuntimeError: Could not find CUDA installation. Please set CUDA_HOME environment variable.
ZONOS2 JIT-compiles custom CUDA kernels at runtime, for things like the embedding lookup.
That needs the CUDA toolkit (nvcc). WSL only had the driver and torch’s bundled runtime, with no nvcc, so it stopped during the JIT in CUDA graph capture.
Install the CUDA toolkit and it reaches generation
Install the CUDA 12.8 toolkit in WSL and set CUDA_HOME.
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-minimal-build-12-8
export CUDA_HOME=/usr/local/cuda-12.8
export PATH="$CUDA_HOME/bin:$PATH"
With that the JIT works, and it moved from CUDA graph capture through to generation.
generate OK in 105.3s
frames=398 eos=390 sr=44100
max_alloc_after_gen=15.508 GiB
Here’s the audio it actually generated. This is the output from local generation on the 8GB laptop, converted straight to MP3.
The result for “私は数秒の音声からどんな声でも再現できます。” (“I can reproduce any voice from a few seconds of audio.”) is a 44.1kHz, mono, ~4.5-second WAV.
Throughput was about 4.6 frames/s, so 4.5 seconds of audio took 105 seconds to generate. That’s roughly 1/20 of realtime, nowhere near real time.
About half the weights live in main RAM, and every inference step reads them over PCIe. That round trip is what slows it down.
Why it fit in 8GB
There’s only 8GB of dedicated VRAM, but Windows’ shared GPU memory can use up to about half of host RAM.
Here that’s 8GB dedicated + ~15.8GB shared = ~24GB of GPU-side budget, and the 15.5GiB peak fit inside it. On size alone it was always going to fit, and it did.
But “it runs” and “it’s usable” are different things.
- You have to set the KV-cache page count by hand in code (not the official, standard procedure)
- An extra CUDA toolkit install is required
- Speed is about 1/20 of realtime
To use it the straightforward way, the floor is a 16GB-class GPU or larger that can hold the 14.3GiB of weights entirely in physical VRAM.
Running it on 8GB is a brute-force trick that depends on the fallback.
flowchart TD
A[Naive load on 8GB VRAM] --> B{14.3GiB weights exceed 8GB physical}
B --> C[Fallback spills<br/>~7GB into shared memory]
C --> D{0 physical VRAM<br/>left for KV}
D -->|stop| E[Not enough memory<br/>for KV cache]
E --> F[Set page count by hand<br/>with num_page_override]
F --> G{JIT compile of<br/>custom kernels}
G -->|stop| H[No CUDA_HOME<br/>nvcc missing]
H --> I[Install CUDA 12.8 toolkit]
I --> J[Generation succeeds<br/>4.5s audio in 105s]
Fixing Japanese proper-noun pitch accent with spelling
Now that local generation worked on the 8GB laptop, I had this blog’s character “Kana-chan” say a line for fun.
The line is “こんちわ~、かなだよ。今何してるのかな?” (“Hiya, I’m Kana. What’re you up to?”).
Here a very Japanese-TTS problem showed up. The name “かな” (Kana) is read with the same pitch accent as the common noun “仮名” (kana, the kana writing system).
The pitch rises at the end, and it doesn’t sound like a person’s name.
The cause is structural. Because ZONOS2 takes raw UTF-8 bytes with no phonemization step, it has neither a pitch-accent dictionary nor any “this is a proper noun” hint.
With nothing to tell whether “かな” is a name or a common noun, it reads it with the “仮名” accent that’s presumably more common in the training data.
Running it through the Space’s text normalization doesn’t fix it; that handles numbers and symbols, not pitch accent.
There’s no way to specify the accent directly, so I worked around it by giving hints through spelling.
I changed only how “かな” is written, kept the voice and seed fixed, and compared by ear.
Hiragana “かな”. The name takes the “仮名” accent and the pitch rises at the end.
Katakana “カナ”. Reads cleanly with a name-like accent.
“かなちゃん” with the -chan suffix. The suffix also nudges it toward the name reading.
Katakana spelling or a “-chan” suffix pulled it toward the name accent.
If you’re having it speak Japanese names or proper nouns, it’s more stable to write them in katakana or as a nickname than to throw plain hiragana at it.
Sampling temperature matters too.
Pushing temperature up to 1.3 broke the pronunciation itself, and words collapsed before the name was even an issue. These comparisons are at temperature 0.7–0.8, and dropping it to that range brings clarity back. Keep the temperature low if you want a consistent voice.
How it compares to other local TTS
I’ve looked at Qwen3-TTS, LuxTTS, and the widely used, Japanese-specialized Irodori-TTS on this blog before.
ZONOS2 sits in a different spot from all of them.
| Model | Main strength | Japanese | Runtime weight |
|---|---|---|---|
| ZONOS2 | High-fidelity voice cloning, MoE, 44.1kHz | Tier 1 | 15GB-class, CUDA-only |
| Irodori-TTS | Japanese-specialized, emoji-driven style and emotion control, zero-shot cloning | Japanese-specialized | ~500M, light (runs on CPU too) |
| Qwen3-TTS | Easy setup, 10 languages incl. Japanese, 3-second cloning | Supported | 0.6B/1.7B class |
| LuxTTS | Lightweight, 1GB VRAM, speed | Not aimed at Japanese | Light |
If Japanese is all you need, the Japanese-specialized, lightweight Irodori-TTS or the easy-to-set-up Qwen3-TTS take fewer steps.
ZONOS2 is the pick when you want high-fidelity cloning, 44.1kHz quality, and multilingual code-switching, at the cost of being heavy.
What’s clear so far
ZONOS2 puts Japanese in Tier 1.
The API has ja, and the Hugging Face Spaces have Japanese example text.
I couldn’t pull a ready-made Japanese sample from the official blog’s comparison clips, but this time I generated one locally on an 8GB RTX 4060 Laptop and produced the Japanese audio myself.
Here are the current options for getting Japanese audio out of it.
| Use | Current option |
|---|---|
| Check the official spec | Hugging Face model card, GitHub README |
| Get a Japanese example sentence | The examples in the multimodalart/ZONOS2 Space |
| Get Japanese audio easily | Hugging Face Space (ZeroGPU) or Zyphra Cloud |
| Get Japanese audio locally | A 16GB+ GPU is the clean path; 8GB works only as a fallback-dependent trick (~1/20 realtime) |
I actually got to hear the Japanese output, and even the default voice was usable as-is for short lines.
But proper-noun readings need spelling nudges; the name “かな” coming out with the “仮名” accent is one example.