Tech 13 min read

ZONOS2 on an 8GB RTX 4060 Laptop (WSL2): it runs, but ~20x slower than realtime

IkesanContents

I ran Zyphra’s ZONOS2 locally on an 8GB RTX 4060 Laptop (WSL2) and got it to generate Japanese audio.
The catch: the 15.3GB weights only fit by spilling into Windows shared GPU memory, you have to override the KV-cache page count by hand, you need the CUDA toolkit for the JIT kernels, and generation lands around 1/20 of realtime. The walk-up to that result is below.

Zyphra released ZONOS2 on June 12, 2026.
It’s an MoE TTS with 8B total parameters and 900M active at inference, and the weights are on Hugging Face.
Japanese is Tier 1 alongside English and Chinese, but the comparison clips on the official blog are mostly English, and there’s no ready-made Japanese audio file to drop straight into a post.

The one Japanese sample I could find is the example sentence in the Hugging Face Space multimodalart/ZONOS2.

私は数秒の音声からどんな声でも再現できます。

That’s an example for generating on the Space, not a pre-generated audio file.
As of June 14, 2026, YouTube searches didn’t surface an easy-to-reference Japanese audio demo for ZONOS2 either; the results were mostly old Zonos v0.1 English tutorials.

ZONOS2 in brief

ZONOS2 generates DAC tokens autoregressively from text and turns them back into 44.1kHz audio.
The official blog describes it like this.

ItemDetail
Model size8B total params, 900M active
ArchitectureSparse MoE
Training audio6M+ hours
Output44.1kHz audio via DAC
Voice cloningConditioned on a 2048-dim speaker embedding (the code and config use a Qwen3 voice embedding; the official overview text says ECAPA-TDNN, which doesn’t match)
LicenseApache 2.0 per the model card
Max generationThe blog says up to 1 minute of multilingual, code-switching audio

Hugging Face’s params.json lists 28 layers, hidden dim 2048, 9 codebooks, codebook size 1024, and a max sequence length of 6144.
The MoE has 16 experts, top-k 1, with only layer 26 using top-k 2.

The main checkpoint, model.pth, is about 15.3GB.
With 15.3GB of weights and a CUDA-only implementation, the default mental model is an open-weight TTS you run on a CUDA server.

Japanese is Tier 1

The Hugging Face and GitHub READMEs split supported languages into three tiers.
Tier 1 is English, Chinese, and Japanese. Tier 2 adds Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, and Dutch.

The API also has a language parameter, with ja available as the text-normalization setting.

{
  "text": "私は数秒の音声からどんな声でも再現できます。",
  "language": "ja",
  "stream": true
}

The reason Japanese gets first-class treatment in ZONOS2 comes from a change since the old Zonos v0.1.
Per Zyphra, ZONOS2 doesn’t rely on explicit phonemization and instead uses raw UTF-8 bytes as its input representation.
They write that this cuts down on errors from phonemization dictionaries and language labels, and improves handling of non-European languages like Chinese, Korean, and Japanese.
And because it doesn’t depend on fixed language tokens, it handles mid-sentence language switching more easily too.

The official blog’s audio samples are mostly English

Zyphra’s official blog has comparison clips lining up ZONOS2, Fish Audio, Qwen, Cartesia, ElevenLabs, and others.
I could pull 33 audio assets from the page. The displayed speakers/prompts were things like Dwarkesh, Trump, British Female, Parks and Recreation Guy, David Attenborough, Arlechino, and Obama, and the visible text was English.

The official blog is an audio-comparison page, but it’s not a place to grab a fixed Japanese sample URL.
The Tier 1 Japanese claim has to be checked against the model card, the README, the Space’s example sentences, or audio you generate yourself.

The Hugging Face Spaces have Japanese example text

As of June 14, 2026, there are two Spaces using ZONOS2 on Hugging Face.

SpaceStatusJapanese clue
multimodalart/ZONOS2Running on ZeroJapanese in the language dropdown; Japanese in the examples
Mike0021/zonos2ZeroGPUja in the language dropdown; examples are English and French

multimodalart/ZONOS2’s app.py has "Japanese": "ja" in its language map, and this line in the examples.

["私は数秒の音声からどんな声でも再現できます。", "Japanese"]

That was the only Japanese example I found as of June 14, 2026.
But it’s the kind you run through the Space UI to generate audio, not a fixed-URL audio file you can paste straight into a post like the official blog’s clips.

Local execution assumes Linux and CUDA

The README’s Quick Start says the only supported platform is Linux x86_64, and you need an NVIDIA GPU with the CUDA Toolkit.
This isn’t the kind of local LLM where you grab a GGUF and try it on Apple Silicon.

The GitHub README’s launch example looks like this.

git clone https://github.com/Zyphra/Zonos2.git
cd Zonos2
uv sync
uv run python -m zonos2 --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/

The server comes up at http://localhost:1919 by default.
To send Japanese, put ja in language.

curl -X POST http://localhost:1919/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text":"私は数秒の音声からどんな声でも再現できます。","language":"ja","stream":true}' \
  --output zonos2-ja.pcm

The response is float32 PCM, 44.1kHz, mono. Convert to WAV like this.

ffmpeg -f f32le -ar 44100 -ac 1 -i zonos2-ja.pcm zonos2-ja.wav

The Hugging Face model card’s launch command is python -m minisgl, while the GitHub README uses python -m zonos2.
As of June 14, 2026 the two disagree, so check both the latest GitHub README and the Space implementation when you run it.

Actually running it on an 8GB RTX 4060 Laptop

When the docs only say “Linux x86_64 + CUDA”, an 8GB laptop looks hopeless.
Running it on an RTX 4060 Laptop (8GB VRAM) under Windows 11 + WSL2, I got stuck twice along the way, but it eventually generated Japanese audio.

Test setup:

  • Windows 11 + WSL2 (Ubuntu 22.04)
  • NVIDIA GeForce RTX 4060 Laptop GPU (8GB dedicated VRAM)
  • 31.7GB host RAM (WSL gets ~15.8GB by default)
  • torch 2.9.1+cu128, model in bf16

A naive load stops at the KV cache

Loading directly with TTSLLM(model_path="Zyphra/ZONOS2"), the bf16 weights are about 14.3GiB and don’t fit in 8GB of physical VRAM.
It doesn’t crash, though, because the WSL2 NVIDIA driver has a “system-memory fallback”: anything past VRAM spills automatically into shared GPU memory (the host’s main RAM), and cudaMalloc succeeds even past 8GB.

MetricValue
GPU memory PyTorch requested14.287 GiB
Physical VRAM used7.95GB (pinned at the 8GB ceiling, 0 free)
Spill into shared GPU memory (measured peak on Windows)7.12GB
WSL RAM peak during the CPU-side load~15.4GB (of 15.8GB)

Because the implementation uses torch.load(map_location="cpu") to expand the 15.3GB checkpoint into system RAM first and then move it to the GPU, the load nearly maxes out WSL’s RAM too.

But it stops here. The error isn’t an OOM crash but this assertion.

AssertionError: Not enough memory for KV cache, try reducing --num-tokens

It’s not that the KV cache (the working area that holds each token’s key/value during generation) is too big.
The weights ate all 8GB of physical VRAM first, leaving 0 physical VRAM for the KV cache, so the engine decided it couldn’t allocate even one page and stopped.

Override the KV-cache page count to get past it

Reading engine/config.py, there’s an override parameter for the KV-cache page count.

num_page_override: int | None = None  # if not None, will override the number of pages

The automatic calculation derives the page count from “free memory before load − weight size”, so when the weights are larger it goes negative and dies on the assert.
Setting num_page_override explicitly skips that calculation entirely.

tts = TTSLLM(model_path="Zyphra/ZONOS2", dtype=torch.bfloat16,
             num_page_override=4096, max_running_req=1)

With that, the KV cache (4096 pages = 0.22GiB) also lands in shared memory, and the assert passes.
The next wall was different.

RuntimeError: Could not find CUDA installation. Please set CUDA_HOME environment variable.

ZONOS2 JIT-compiles custom CUDA kernels at runtime, for things like the embedding lookup.
That needs the CUDA toolkit (nvcc). WSL only had the driver and torch’s bundled runtime, with no nvcc, so it stopped during the JIT in CUDA graph capture.

Install the CUDA toolkit and it reaches generation

Install the CUDA 12.8 toolkit in WSL and set CUDA_HOME.

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-minimal-build-12-8
export CUDA_HOME=/usr/local/cuda-12.8
export PATH="$CUDA_HOME/bin:$PATH"

With that the JIT works, and it moved from CUDA graph capture through to generation.

generate OK in 105.3s
frames=398 eos=390 sr=44100
max_alloc_after_gen=15.508 GiB

Here’s the audio it actually generated. This is the output from local generation on the 8GB laptop, converted straight to MP3.

The result for “私は数秒の音声からどんな声でも再現できます。” (“I can reproduce any voice from a few seconds of audio.”) is a 44.1kHz, mono, ~4.5-second WAV.
Throughput was about 4.6 frames/s, so 4.5 seconds of audio took 105 seconds to generate. That’s roughly 1/20 of realtime, nowhere near real time.
About half the weights live in main RAM, and every inference step reads them over PCIe. That round trip is what slows it down.

Why it fit in 8GB

There’s only 8GB of dedicated VRAM, but Windows’ shared GPU memory can use up to about half of host RAM.
Here that’s 8GB dedicated + ~15.8GB shared = ~24GB of GPU-side budget, and the 15.5GiB peak fit inside it. On size alone it was always going to fit, and it did.

But “it runs” and “it’s usable” are different things.

  • You have to set the KV-cache page count by hand in code (not the official, standard procedure)
  • An extra CUDA toolkit install is required
  • Speed is about 1/20 of realtime

To use it the straightforward way, the floor is a 16GB-class GPU or larger that can hold the 14.3GiB of weights entirely in physical VRAM.
Running it on 8GB is a brute-force trick that depends on the fallback.

flowchart TD
    A[Naive load on 8GB VRAM] --> B{14.3GiB weights exceed 8GB physical}
    B --> C[Fallback spills<br/>~7GB into shared memory]
    C --> D{0 physical VRAM<br/>left for KV}
    D -->|stop| E[Not enough memory<br/>for KV cache]
    E --> F[Set page count by hand<br/>with num_page_override]
    F --> G{JIT compile of<br/>custom kernels}
    G -->|stop| H[No CUDA_HOME<br/>nvcc missing]
    H --> I[Install CUDA 12.8 toolkit]
    I --> J[Generation succeeds<br/>4.5s audio in 105s]

Fixing Japanese proper-noun pitch accent with spelling

Now that local generation worked on the 8GB laptop, I had this blog’s character “Kana-chan” say a line for fun.
The line is “こんちわ~、かなだよ。今何してるのかな?” (“Hiya, I’m Kana. What’re you up to?”).

Here a very Japanese-TTS problem showed up. The name “かな” (Kana) is read with the same pitch accent as the common noun “仮名” (kana, the kana writing system).
The pitch rises at the end, and it doesn’t sound like a person’s name.

The cause is structural. Because ZONOS2 takes raw UTF-8 bytes with no phonemization step, it has neither a pitch-accent dictionary nor any “this is a proper noun” hint.
With nothing to tell whether “かな” is a name or a common noun, it reads it with the “仮名” accent that’s presumably more common in the training data.
Running it through the Space’s text normalization doesn’t fix it; that handles numbers and symbols, not pitch accent.

There’s no way to specify the accent directly, so I worked around it by giving hints through spelling.
I changed only how “かな” is written, kept the voice and seed fixed, and compared by ear.

Hiragana “かな”. The name takes the “仮名” accent and the pitch rises at the end.

Katakana “カナ”. Reads cleanly with a name-like accent.

“かなちゃん” with the -chan suffix. The suffix also nudges it toward the name reading.

Katakana spelling or a “-chan” suffix pulled it toward the name accent.
If you’re having it speak Japanese names or proper nouns, it’s more stable to write them in katakana or as a nickname than to throw plain hiragana at it.

Sampling temperature matters too.
Pushing temperature up to 1.3 broke the pronunciation itself, and words collapsed before the name was even an issue. These comparisons are at temperature 0.7–0.8, and dropping it to that range brings clarity back. Keep the temperature low if you want a consistent voice.

How it compares to other local TTS

I’ve looked at Qwen3-TTS, LuxTTS, and the widely used, Japanese-specialized Irodori-TTS on this blog before.
ZONOS2 sits in a different spot from all of them.

ModelMain strengthJapaneseRuntime weight
ZONOS2High-fidelity voice cloning, MoE, 44.1kHzTier 115GB-class, CUDA-only
Irodori-TTSJapanese-specialized, emoji-driven style and emotion control, zero-shot cloningJapanese-specialized~500M, light (runs on CPU too)
Qwen3-TTSEasy setup, 10 languages incl. Japanese, 3-second cloningSupported0.6B/1.7B class
LuxTTSLightweight, 1GB VRAM, speedNot aimed at JapaneseLight

If Japanese is all you need, the Japanese-specialized, lightweight Irodori-TTS or the easy-to-set-up Qwen3-TTS take fewer steps.
ZONOS2 is the pick when you want high-fidelity cloning, 44.1kHz quality, and multilingual code-switching, at the cost of being heavy.

What’s clear so far

ZONOS2 puts Japanese in Tier 1.
The API has ja, and the Hugging Face Spaces have Japanese example text.
I couldn’t pull a ready-made Japanese sample from the official blog’s comparison clips, but this time I generated one locally on an 8GB RTX 4060 Laptop and produced the Japanese audio myself.

Here are the current options for getting Japanese audio out of it.

UseCurrent option
Check the official specHugging Face model card, GitHub README
Get a Japanese example sentenceThe examples in the multimodalart/ZONOS2 Space
Get Japanese audio easilyHugging Face Space (ZeroGPU) or Zyphra Cloud
Get Japanese audio locallyA 16GB+ GPU is the clean path; 8GB works only as a fallback-dependent trick (~1/20 realtime)

I actually got to hear the Japanese output, and even the default voice was usable as-is for short lines.
But proper-noun readings need spelling nudges; the name “かな” coming out with the “仮名” accent is one example.

References