Irodori-TTS Japanese voice clone on a 4GB RTX 3050 Ti: 3.2s gen, FFmpeg DLL fix

I ran Aratako’s Japanese TTS “Irodori-TTS” on a Windows laptop with a 4GB RTX 3050 Ti, from default-voice generation to a zero-shot voice clone built on a single 4-second reference clip.

What Irodori-TTS is

Irodori-TTS is an open TTS model based on Flow Matching.
Flow Matching is a relative of diffusion models: instead of removing noise little by little, it learns a path that runs straight from noise to data, so generation finishes in fewer steps.
The architecture and training design follow Echo-TTS, generating into the continuous latent space of a DACVAE.

Item	Details
Base model	`Aratako/Irodori-TTS-500M-v3` (about 1.8GB)
Architecture	Rectified Flow Diffusion Transformer (RF-DiT)
Codec	Semantic-DACVAE-Japanese-32dim, 48kHz output
Voice cloning	Zero-shot from a reference clip
Length estimation	v3 has a built-in duration predictor; no `--seconds` needed
Emoji style control	Emotion and non-verbal expression controlled by emoji in the input text (on supported checkpoints)
Watermark	Automatic watermarking via SilentCipher
VoiceDesign variant	`Aratako/Irodori-TTS-600M-v3-VoiceDesign`; voice quality specified in text
License	Code is MIT

Generated audio automatically carries an audio watermark (an identification signal humans can’t hear). Having an anti-abuse measure built in from the start is a good design call for this kind of model.

Environment

Item	Details
PC	ASUS ROG Zephyrus G14 (GA401QE)
OS	Windows 11 Home
CPU	AMD Ryzen 7 5800HS (8 cores)
GPU	NVIDIA GeForce RTX 3050 Ti Laptop (4GB VRAM)
GPU driver	555.97 (CUDA 12.5)
RAM	16GB
Python	3.10 (fetched automatically by uv)
PyTorch	2.10.0+cu128

PyTorch is a CUDA 12.8 build, but it runs on any CUDA 12.x driver thanks to minor-version compatibility.

Setup

The project assumes uv for package management. I didn’t have it yet, so installation came first.

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Clone the repository and install the CUDA dependencies.

git clone https://github.com/Aratako/Irodori-TTS.git
cd Irodori-TTS
uv sync --extra cu128

Backends are switched via --extra (uv’s optional dependency groups): besides cu128 there are rocm (AMD), xpu (Intel), and cpu.
The .python-version file makes uv fetch Python 3.10 on its own, so the system Python version doesn’t matter.

Check that the GPU is visible.

uv run --no-sync python -c "import torch; print(torch.cuda.is_available())"
# → True (NVIDIA GeForce RTX 3050 Ti Laptop GPU)

The model weights (about 1.8GB) download automatically into the Hugging Face cache on first inference.

Generating with the default voice

First, generation without a reference — the default voice.

uv run --no-sync python infer.py --hf-checkpoint Aratako/Irodori-TTS-500M-v3 `
  --text "こんにちは、これはイロドリTTSのテスト音声です。今日はいい天気ですね。" `
  --no-ref --output-wav outputs/test1.wav

Here’s the output. Same text, generated twice.

Timings:

Run	Time
First run (incl. model download)	317.7s
Second run (whole process)	19.8s
Second run, generation pipeline only	~3.2s

Most of the 20 seconds is process startup and model loading; generation itself takes about 3.2s for 6.4s of audio — half of realtime.
Keep the process alive (Gradio UI, a resident server) and each generation only costs that ~3.2s.

Voice cloning

The main event. As the reference, I used the “Kana” voice (4.0s, 44.1kHz) generated in the ZONOS2 post.

uv run --no-sync python infer.py --hf-checkpoint Aratako/Irodori-TTS-500M-v3 `
  --text "こんにちは、これはお手本の声をもとにしたボイスクローンのテストです。うまく似ているでしょうか。" `
  --ref-wav ref_sample.wav --output-wav outputs/cloned_test.wav

The output.

Even with a single 4-second reference, the pitch and texture come out close to the reference voice.

The whole process took 26.0s, with the generation pipeline at about 6.2s (of which encoding the reference audio took about 1.1s).
It runs longer than the default-voice generation because the text is longer and the reference encoding is added on top.

Gotchas

Inference was completely fine on 4GB of VRAM.
LoRA training, on the other hand, needs about 4.2GB at default settings, which just barely doesn’t fit on this machine.

MP3 references fail with an FFmpeg DLL error

Point --ref-wav at an MP3 as-is and torchaudio (via torchcodec) demands FFmpeg’s shared DLLs and fails.
Installing FFmpeg itself would work too, but converting to WAV with soundfile is quicker.

uv run --no-sync python -c "import soundfile as sf; d, sr = sf.read('ref_sample.mp3', dtype='float32'); sf.write('ref_sample.wav', d, sr)"

Text-based voice specification is a separate model

VoiceDesign — specifying voice quality in text like “a calm female voice” — uses Aratako/Irodori-TTS-600M-v3-VoiceDesign, not the base model.
You pass a caption like --caption "落ち着いた女性の声", and v3 can also combine a reference clip with a caption (voice from the reference; emotion and speaking style from text).

As the repository’s full name — “A Flow Matching-based TTS Model with Emoji-driven Style Control” — says, controlling emotion and non-verbal expression with emoji in the input text is the model’s signature feature. I haven’t tried it this time.