Tech 11 min read

VoxCPM2 and OSS TTS in 2026: Irodori-TTS, F5-TTS, and Japanese fine-tune notes

IkesanContents

VoxCPM2 is a 2B-parameter tokenizer-free TTS.
The Hugging Face model card lists 2M+ hours of multilingual audio, 30 languages including Japanese, 48kHz output, and Apache-2.0.
That’s two billion parameters — small for an LLM, but not negligible once you add up everything a TTS stack needs around it.

RTF is around 0.30 on RTX 4090 with vanilla inference and around 0.13 with Nano-vLLM.
0.13 seconds to produce one second of audio puts you within reach of realtime conversation under the right conditions.

I’ve covered Qwen3-TTS, MioTTS, LuxTTS, and Sarashina2.2-TTS here before.
Among them, VoxCPM2 leans away from lightness and toward preserving as much of the original audio character as possible.

TTS that skips discrete tokens

The OpenBMB README describes VoxCPM as tokenizer-free Text-to-Speech.
Traditional high-quality TTS often runs audio through a codec or tokenizer to turn it into discrete tokens, has an audio LM generate token sequences, then sends those through a vocoder back to waveform.

Fish Speech and Bark-style designs sit in this lineage.
By collapsing audio into “symbol streams that an LLM can chew on,” they get to bring text-LM techniques along directly.
The cost is that breath, pauses, the way consonants get crushed, the recorded-room air, and emotional micro-shifts tend to get lost at the discretization step.

The VoxCPM family swings the other way and treats those as continuous audio representations directly.
The model card lists a tokenizer-free Diffusion Autoregressive setup with LocEnc, TSLM, RALM, and LocDiT in the pipeline.
Roughly: don’t force audio into symbol IDs, keep it continuous, and let a diffusion-style generator fill it in.

This is close to the texture gap between VQ-VAE-style discrete latents and continuous latent diffusion in image generation.
Audio has temporal coherence so it’s not a clean drop-in, but if you want to preserve breath and prosody, you can see why the continuous side is appealing.

2B at 48kHz, but not a lightweight

2B is on the small side for LLMs, but TTS doesn’t end with the LM.
Audio VAE, diffusion decoder, denoiser, and any streaming server stack together decide how heavy it feels in practice.

Model Details on the card list ~8GB VRAM, bfloat16 dtype, and an LM token rate of 6.25Hz.
On an RTX 4090 this is genuinely fast, but expecting the same feel on Mac MPS is unrealistic.
Diffusion and flow-matching systems especially are tuned with CUDA in mind, and the gap on Apple Silicon shows up clearly.

On an M1 Max 64GB, lightweight options like Kokoro, Piper, and XTTS v2 are easy to spin up locally.
Fish Speech and F5-TTS are case-by-case.
For VoxCPM2’s full quality, “does it run” is the easy question; “is it tolerable as conversational latency” is the one to actually check.

OSS TTS has split into multiple directions

OSS TTS over the past year hasn’t just gotten higher-quality — the directions models aim at have diverged.

ModelDirection
VoxCPM2tokenizer-free, 2B, 30 languages, 48kHz, voice design, controlled cloning
F5-TTSflow matching, lighter footprint, fast inference, voice matching from short reference audio
Fish SpeechLLM-based, multilingual, with emotion tags and conversational feel
CosyVoice2large-scale zero-shot TTS that handles streaming and offline in one frame
IndexTTSstrong controllability in the Chinese-language space, voice preservation, long-text stability
Kokoro82M-class lightweight TTS, suited for embedded or always-on local use
PiperCPU-oriented lightweight reader, stability over peak quality

F5-TTS reports PyTorch-offline RTF 0.1467 on an L20 GPU, with TensorRT-LLM going lower still.
CosyVoice2 brings streaming and offline into one model via an LLM plus chunk-aware causal flow matching.
Fish Speech drops G2P conversion entirely and lets the LLM handle linguistic features directly, leaning into multilingual support and voice cloning.

Speech synthesis is now split across “read text lightly,” “match a voice,” “control emotion,” “respond in realtime,” and “don’t break across languages” — and each direction has its own strong model.
VoxCPM2 is the one going hard on tokenizer-free quality.

Japanese TTS still has multiple winning angles

Japanese TTS is hard to judge from English-language rankings alone.
Pitch accent, vowel devoicing, breath leakage, context-dependent intonation, and anime-style vocal delivery all factor in, and “Japanese supported” on a model card often doesn’t match the practical feel.

Beyond the general-purpose large models above, there are several practical options for Japanese local use.

ModelDirection
Sarashina2.2-TTSJapanese-made, Japanese zero-shot, ~500M, a plain zero-shot model from SB Intuitions
Irodori-TTSJapanese-focused, Rectified Flow DiT, 500M, emoji-driven emotion/effect control, MIT
Style-Bert-VITS2VITS-family, numerical control over emotion and style strength, runs even on CPU
AivisSpeechPractical engine built on the Style-Bert-VITS2 family, Mac/Windows, runs without GPU
VOICEVOXCharacter-voice focused, the established choice for stable narration, engine is OSS

Irodori-TTS is a Japanese-focused TTS that Aratako released between March and May 2026, a 500M-parameter Rectified Flow Diffusion Transformer.
The “treat continuous latent representations and fill in via diffusion” approach is close to VoxCPM2’s, and you can read it as the tokenizer-free trend producing a Japanese-specialized variant.
The distinctive part is emoji-based style control: drop 😭, 🤧, or 👂😮‍💨 into the input text and you get crying, coughing, or whispering styles.
MIT license, 48kHz output, zero-shot voice cloning — and an explicit “kanji reading is weak, convert to hiragana first” disclaimer baked into the model card.

Style-Bert-VITS2 is a long-maintained VITS-family derivative in the Japanese community, where being able to dial emotion strength numerically genuinely matters.
AivisSpeech wraps that into an easier-to-use engine that runs on Mac/Windows without a GPU.
VOICEVOX is character-voice focused, going after stable narration-quality reading.
This group is light on inference cost, which makes it a good fit for local assistants or always-on AITuber-style use.

Large 30-language tokenizer-free models like VoxCPM2 and Japanese-focused lightweight models like Irodori-TTS or Style-Bert-VITS2 are after different things.
The big side wins on multilingual breadth, the 48kHz “in-the-room” air, and the range of voices reference audio can reproduce.
The Japanese-focused side wins on natural rendering of kanji-mixed text, lightness, and operational ease.

If you’re working locally in Japanese, the realistic order is to first run everyday text through Irodori-TTS or Style-Bert-VITS2 and listen to where they break, then move up to VoxCPM2 or CosyVoice2 once voice-matching or multilingual code-switching becomes a bottleneck.

VoxCPM2 itself does include Japanese in its 30 languages, but how it handles Japanese-native proper nouns, mixed numbers, long-form reading, colloquial speech, and the pauses around punctuation needs to be listened to separately.
48kHz and voice design are flashy, but in Japanese reading the parts that jump out as “wrong” are often somewhere else entirely.

Why training data ships as a script

Recent Japanese TTS corpora are typically distributed as pairs of audio plus transcribed text — essentially a script.
Phoneme sequences and accent information usually aren’t packaged in, and the reason training still works is that Japanese TTS like Style-Bert-VITS2 is built around the assumption that OpenJTalk and MeCab will do the phoneme and accent extraction.

A rough trace of the Style-Bert-VITS2 training pipeline:

flowchart TD
    A[Audio wav] --> P[Auto preprocess]
    B[Transcribed script] --> C[MeCab morphological analysis]
    C --> R[Reading katakana sequence]
    R --> O[pyopenjtalk_prosody]
    O --> E[Phonemes + up/down marks]
    E --> P
    B --> X[BERT embeddings]
    X --> P
    P --> M[Style-Bert-VITS2 training]

So once you have wav files and a transcript, the phoneme sequence and accent marks are generated automatically by pyopenjtalk_prosody.
Corpus authors only need to get the “reading-aloud script” right; they don’t have to hand-annotate phoneme strings or pitch accents.
That automated preprocessing rail is part of why Japanese TTS data has come together this quickly in recent years.

The downside is that pyopenjtalk’s reading accuracy directly caps training quality.
A frequently-cited example: the JSUT corpus contains an entry where the written word “月印” is actually spoken as “ルナグラム” — the morphological analyzer’s reading and the actual speech don’t match.
MeCab-family morphological analyzers don’t look at context or audio; they return one reading from a dictionary, so proper nouns, ateji (creative kanji readings), recent slang, mixed English, and personal names get mis-read regularly.
Derivative tools like “Furigana Whisper” exist specifically to correct this by inferring readings from the audio side.

When Irodori-TTS says “kanji reading accuracy is weak, convert to hiragana first,” the root cause is the same.
You can push the continuous-latent side as hard as you want, but if the input-text-to-reading step misfires, the audio you get out reads something other than what you wrote.
The ceiling on Japanese TTS naturalness is often set by the G2P (grapheme-to-phoneme) layer, not the model architecture.

This is also where the large multilingual models like VoxCPM2 and CosyVoice2 diverge from the Japanese-focused ones.
The large-model side leans on subword tokenizers and large-scale pretraining to learn character-to-phoneme-like representations end-to-end, which lets them sidestep OpenJTalk.
On the other hand, Japanese pitch-accent rules and intonation are quietly a strong suit for the rule-based OpenJTalk, and that’s why the Style-Bert-VITS2 and AivisSpeech family is still alive and well in Japanese local use.

Flipping it around: if you want to train your own voice, you don’t need to run hundreds of hours of multilingual training.
30 minutes to a few hours of script-paired audio plus Style-Bert-VITS2 fine-tuning is the most realistic path in Japanese local right now.
From there, you patch weak readings via the user dictionary, or re-extract furigana from the audio side with Whisper, as part of regular operation.

Voice cloning has hit a genuinely dangerous speed

VoxCPM2 pitches Controllable Voice Cloning and Ultimate Cloning as headline features.
A short reference audio pulls the timbre toward the speaker, and passing both reference audio and its transcript strengthens reproduction of timbre, rhythm, emotion, and style.

This is genuinely useful for creative or narration work, but at phone-call audio quality the abuse threshold drops in lockstep.
A few-seconds sample, emotional reproduction, and near-realtime inference all together get you uncomfortably close to fake-relative phone scams and impersonation of streamers or VTubers.

The model card’s Limitations section explicitly forbids impersonation, fraud, and disinformation.
Apache-2.0 being commercially friendly is not the same thing as “do whatever you want.”
TTS systems that do voice cloning have to design consent flows, generated-audio labeling, watermarking, and behavior under voice authentication into the system from day one, or operators will get cornered downstream.

When trying it, watch failure modes more than RTF

RTF 0.13 is conditional on RTX 4090, Nano-vLLM, and an optimized serving stack.
For local Mac or small-GPU evaluation, the faster path is to feed in roughly six categories — short text, long text, Japanese, mixed English, numerals, proper nouns, and noisy reference audio — and listen to how it breaks.

Generation settings matter too.
The Hugging Face samples expose inference_timesteps, cfg_value, denoise, and retry_badcase.
Optimizing only for speed tends to bring out skipped pronunciations, abnormally long pauses, and erratic tonal swings.

References