Tech 6 min read

Sarashina2.2-TTS Is a Japanese-First Zero-Shot Voice Synthesis Model

IkesanContents

SB Intuitions released sarashina2.2-tts.
A Japanese-centric Text-to-Speech model at 0.8B params on Hugging Face.
It picks up speaker voice, speaking style, and acoustic characteristics from a short reference audio clip, generating speech without fine-tuning.

On this blog I’ve previously looked at Qwen3-TTS, MioTTS, and LuxTTS.
In that context, sarashina2.2-tts feels less about “lightweight,” “commercially friendly,” or “runs on existing LLM runtimes” and more about foregrounding Japanese pronunciation, speaking styles, and training data provenance.

Granular Japanese Speaking Style Samples

What stands out on the model card is that it doesn’t just say “supports Japanese” and move on.
The samples are split across narration, news reading, conversation, customer service, and even rakugo-style speech.

With Japanese TTS, the hard part isn’t just reading accuracy.
Pause timing at punctuation, how casual speech degrades, the stiffness of a news script, the politeness register of phone support—these style differences are immediately noticeable.
sarashina2.2-tts makes this a front-and-center selling point on its model card.

Supported languages are Japanese and English.
The card includes examples of generating Japanese from an English speaker’s reference audio, English from a Japanese speaker’s reference, and code-switching where English phrases appear mid-Japanese sentence.
This isn’t a read-aloud model confined to Japanese alone—it’s aiming to maintain voice identity across both languages.

Three Separate Conditioning Streams from Reference Audio

Looking at the sample code in the GitHub repository, generation extracts multiple features from the reference audio.

audio_prompt_tokens = generator._extract_audio_prompt_tokens(
    audio_prompt_path=audio_prompt_path
)
flow_embedding = generator._extract_zero_shot_embedding(
    audio_prompt_path=audio_prompt_path
)
audio_prompt_feat = generator._extract_audio_prompt_feat(
    audio_prompt_path=audio_prompt_path
)

Then the target text, a transcript of the reference audio, the reference audio tokens, zero-shot embedding, and acoustic features all go into generate().
Rather than throwing the reference audio in as a single input, the design conditions on text content, speaker identity, and acoustic features separately.

Under the hood, projects incorporated or referenced include CosyVoice, HiFT-GAN, 3D-Speaker, and SilentCipher.
Think of it as a combination of CosyVoice-lineage TTS pipeline, speaker embedding, vocoder, and inaudible watermarking for generated audio.

The watermark is embedded by default.
The README states that all generated audio contains a SilentCipher inaudible watermark and asks users not to remove or disable it.
For voice cloning models, whether this kind of watermark shows up in the README from day one is a practical differentiator.

Local Execution Leans GPU

Setup installs as a standard Python package.

git clone https://github.com/sbintuitions/sarashina2.2-tts.git
cd sarashina2.2-tts
python -m venv venv
source venv/bin/activate
pip install -e .
python server/gradio_app.py

For vLLM, add pip install -e ".[vllm]" and launch with python server/gradio_app.py --use-vllm.
Docker images are also provided. The standard Transformers-backend image targets “GPUs with around 6 GB VRAM.”
The vLLM variant is for higher throughput but requires more VRAM per the README.

The model isn’t currently listed on Hugging Face Inference Providers.
No opening the model page in a browser and hitting an API right away.
To try it out, use the GitHub Web UI or Docker, pulling the model from Hugging Face on first run.

Different Axis from Qwen3-TTS and MioTTS

Lined up against previous local TTS articles, sarashina2.2-tts’s positioning is fairly clear.

ModelStrong axisJapaneseVoice cloningLicensing
Qwen3-TTSpip install simplicity, preset voices, natural-language voice designSupportedSupportedApache 2.0
MioTTSGGUF, llama.cpp, Ollama, vLLM—leaning into LLM inference stacksSupportedLimited by designVaries per model
LuxTTS1 GB VRAM and fast inferenceWeakerSupportedApache 2.0
sarashina2.2-ttsJapanese speaking styles, cross-lingual voice transfer, training data provenancePrimarySupportedNon-commercial

Qwen3-TTS is easy to pick up as a candidate for apps and products.
MioTTS is interesting as an experiment in putting TTS onto LLM inference infrastructure.
LuxTTS goes all-in on lightweight and speed.

sarashina2.2-tts is a bit different from all of them.
If you want to hear natural Japanese and zero-shot voice transfer, it’s worth a look—but commercial use or lightweight distribution immediately hits constraints.

Non-Commercial License and Data Transparency

The license is the Sarashina Model NonCommercial License Agreement.
Commercial use of the model is prohibited; for commercial use you need to contact SB Intuitions directly.
The Hugging Face model card also notes that the audio samples on the page are for research purposes and cannot be redistributed or used commercially.

On the other hand, the license text includes a provision that SB Intuitions does not claim rights over the model’s output data.
But since the model itself is non-commercial, determining what you can do with generated outputs requires reading the “output rights” and “model usage conditions” separately.

Regarding training data, the card explains that they used legitimately purchased audio sources, public audio archives, and data collected in compliance with domestic law, while respecting robots.txt and terms of service.
In TTS, “whose voice was learned from what source” becomes as contentious as model evaluation itself, so leading with this explanation matters.

What Can’t Be Judged Yet

Independent benchmarks or third-party evaluations were nowhere to be found on the model card or README as of this writing.
The sample audio is quite informative, but long-form reading, unknown proper nouns, mixed numbers and symbols, dialect-adjacent colloquial speech, and noisy reference audio are all separate questions.

Realistic speeds on Apple Silicon or CPU are also hard to predict.
At 0.8B params it’s not a huge model, but TTS performance depends on audio feature extraction and vocoder stages, not just the LLM core.
Better to assume CUDA GPU first rather than expecting usable speed on a local Mac out of the box.

The gap between English-centric models that “sort of do Japanese” and models that hold up under native Japanese listening is substantial.
sarashina2.2-tts goes after that gap head-on, so even under a non-commercial license it’s worth trying.
Less a candidate for commercial products, more a reference point for hearing where Japanese zero-shot TTS currently tops out.