Irodori-TTS Japanese voice clone on a 4GB RTX 3050 Ti: 3.2s gen, FFmpeg DLL fix
Contents
I ran Aratako’s Japanese TTS “Irodori-TTS” on a Windows laptop with a 4GB RTX 3050 Ti, from default-voice generation to a zero-shot voice clone built on a single 4-second reference clip.
What Irodori-TTS is
Irodori-TTS is an open TTS model based on Flow Matching.
Flow Matching is a relative of diffusion models: instead of removing noise little by little, it learns a path that runs straight from noise to data, so generation finishes in fewer steps.
The architecture and training design follow Echo-TTS, generating into the continuous latent space of a DACVAE.
| Item | Details |
|---|---|
| Base model | Aratako/Irodori-TTS-500M-v3 (about 1.8GB) |
| Architecture | Rectified Flow Diffusion Transformer (RF-DiT) |
| Codec | Semantic-DACVAE-Japanese-32dim, 48kHz output |
| Voice cloning | Zero-shot from a reference clip |
| Length estimation | v3 has a built-in duration predictor; no --seconds needed |
| Emoji style control | Emotion and non-verbal expression controlled by emoji in the input text (on supported checkpoints) |
| Watermark | Automatic watermarking via SilentCipher |
| VoiceDesign variant | Aratako/Irodori-TTS-600M-v3-VoiceDesign; voice quality specified in text |
| License | Code is MIT |
Generated audio automatically carries an audio watermark (an identification signal humans can’t hear). Having an anti-abuse measure built in from the start is a good design call for this kind of model.
Environment
| Item | Details |
|---|---|
| PC | ASUS ROG Zephyrus G14 (GA401QE) |
| OS | Windows 11 Home |
| CPU | AMD Ryzen 7 5800HS (8 cores) |
| GPU | NVIDIA GeForce RTX 3050 Ti Laptop (4GB VRAM) |
| GPU driver | 555.97 (CUDA 12.5) |
| RAM | 16GB |
| Python | 3.10 (fetched automatically by uv) |
| PyTorch | 2.10.0+cu128 |
PyTorch is a CUDA 12.8 build, but it runs on any CUDA 12.x driver thanks to minor-version compatibility.
Setup
The project assumes uv for package management. I didn’t have it yet, so installation came first.
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
Clone the repository and install the CUDA dependencies.
git clone https://github.com/Aratako/Irodori-TTS.git
cd Irodori-TTS
uv sync --extra cu128
Backends are switched via --extra (uv’s optional dependency groups): besides cu128 there are rocm (AMD), xpu (Intel), and cpu.
The .python-version file makes uv fetch Python 3.10 on its own, so the system Python version doesn’t matter.
Check that the GPU is visible.
uv run --no-sync python -c "import torch; print(torch.cuda.is_available())"
# → True (NVIDIA GeForce RTX 3050 Ti Laptop GPU)
The model weights (about 1.8GB) download automatically into the Hugging Face cache on first inference.
Generating with the default voice
First, generation without a reference — the default voice.
uv run --no-sync python infer.py --hf-checkpoint Aratako/Irodori-TTS-500M-v3 `
--text "こんにちは、これはイロドリTTSのテスト音声です。今日はいい天気ですね。" `
--no-ref --output-wav outputs/test1.wav
Here’s the output. Same text, generated twice.
Timings:
| Run | Time |
|---|---|
| First run (incl. model download) | 317.7s |
| Second run (whole process) | 19.8s |
| Second run, generation pipeline only | ~3.2s |
Most of the 20 seconds is process startup and model loading; generation itself takes about 3.2s for 6.4s of audio — half of realtime.
Keep the process alive (Gradio UI, a resident server) and each generation only costs that ~3.2s.
Voice cloning
The main event. As the reference, I used the “Kana” voice (4.0s, 44.1kHz) generated in the ZONOS2 post.
uv run --no-sync python infer.py --hf-checkpoint Aratako/Irodori-TTS-500M-v3 `
--text "こんにちは、これはお手本の声をもとにしたボイスクローンのテストです。うまく似ているでしょうか。" `
--ref-wav ref_sample.wav --output-wav outputs/cloned_test.wav
The output.
Even with a single 4-second reference, the pitch and texture come out close to the reference voice.
The whole process took 26.0s, with the generation pipeline at about 6.2s (of which encoding the reference audio took about 1.1s).
It runs longer than the default-voice generation because the text is longer and the reference encoding is added on top.
Gotchas
Inference was completely fine on 4GB of VRAM.
LoRA training, on the other hand, needs about 4.2GB at default settings, which just barely doesn’t fit on this machine.
MP3 references fail with an FFmpeg DLL error
Point --ref-wav at an MP3 as-is and torchaudio (via torchcodec) demands FFmpeg’s shared DLLs and fails.
Installing FFmpeg itself would work too, but converting to WAV with soundfile is quicker.
uv run --no-sync python -c "import soundfile as sf; d, sr = sf.read('ref_sample.mp3', dtype='float32'); sf.write('ref_sample.wav', d, sr)"
Text-based voice specification is a separate model
VoiceDesign — specifying voice quality in text like “a calm female voice” — uses Aratako/Irodori-TTS-600M-v3-VoiceDesign, not the base model.
You pass a caption like --caption "落ち着いた女性の声", and v3 can also combine a reference clip with a caption (voice from the reference; emotion and speaking style from text).
As the repository’s full name — “A Flow Matching-based TTS Model with Emoji-driven Style Control” — says, controlling emotion and non-verbal expression with emoji in the input text is the model’s signature feature. I haven’t tried it this time.