MioTTS - a lightweight LLM-based TTS built from a custom codec
Contents
On February 11, 2026, Aratako released a family of lightweight Japanese-English TTS models. As the author says, they were “developed from codec to scratch,” and the release includes everything from the audio codec and vocoder to the TTS model and inference server.
- Collection: Aratako/MioTTS
- Inference code: Aratako/MioTTS-Inference
- Codec: Aratako/MioCodec
- License: varies by model size, see below
The big picture
MioTTS is a TTS model built directly on an LLM architecture. You input text, it autoregressively generates audio tokens, and those tokens are decoded into waveforms by a codec. The biggest advantage is that standard LLM inference frameworks work as-is, so tools like llama.cpp, Ollama, vLLM, and other systems that expose an OpenAI-compatible API can be used directly.
The stack has three pieces:
- MioCodec - a neural audio codec that separates audio into content tokens and a global embedding
- MioTTS itself - an autoregressive LLM-based TTS model ranging from 0.1B to 2.6B
- MioTTS-Inference - an inference server with a REST API and Gradio web UI
MioCodec: the custom audio codec
To understand MioTTS, you need to start with MioCodec.
MioCodec is a neural audio codec based on kanade-tokenizer. It splits audio into two components:
- Content tokens: discrete tokens that represent linguistic information and phonetic content at a 25Hz frame rate
- Global embedding: a continuous vector that represents speaker identity, recording environment, and microphone characteristics
That split is the key to MioTTS’s efficiency. The LLM part only has to generate content tokens, while the speaker information is provided separately as a global embedding. Zero-shot voice cloning is just a matter of combining the source content tokens with the target speaker’s global embedding.
Specs
| Item | MioCodec 24kHz | MioCodec 44.1kHz |
|---|---|---|
| Token rate | 25 Hz | 25 Hz |
| Bitrate | 341 bps | 341 bps |
| Sample rate | 24 kHz | 44.1 kHz |
| Vocoder | iSTFTHead (built in) | MioVocoder (external) |
| Parameters | 132M | 118M (excluding vocoder) |
| Use case | Lightweight, fast inference | Higher quality audio processing |
The 24kHz version integrates iSTFTHead, so it can generate waveforms directly without an external vocoder. The 44.1kHz version is jointly tuned with MioVocoder for studio-quality audio processing.
Training
It was trained on 9 languages and more than 88,000 hours of speech data. The SSL encoder uses WavLM-base+, and the training process has two stages for the 24kHz version or three stages for the 44.1kHz version.
- Phase 1: feature alignment using multi-resolution mel spectrogram loss plus SSL feature reconstruction loss
- Phase 2: perceptual quality improvements through adversarial training with a multi-period discriminator and multi-scale STFT discriminator
- Phase 3 (44.1kHz only): end-to-end joint tuning of the vocoder and codec decoder
The kanade paper, published on arXiv in February 2026, introduced a single-layer separative audio tokenizer. MioCodec extends that architecture by adding an integrated waveform decoder via iSTFTHead and by adjusting the vocabulary size.
The model family
There are six sizes, each initialized from a different base LLM.
| Model | Parameters | Base model | License | RTF |
|---|---|---|---|---|
| MioTTS-0.1B | 0.1B | Falcon-H1-Tiny-Multilingual-100M | Falcon-LLM License | 0.04 - 0.05 |
| MioTTS-0.4B | 0.4B | LiquidAI/LFM2-350M | LFM Open License v1.0 | 0.035 - 0.045 |
| MioTTS-0.6B | 0.6B | Qwen3-0.6B-Base | Apache 2.0 | 0.055 - 0.065 |
| MioTTS-1.2B | 1.2B | LiquidAI/LFM2.5-1.2B-Base | LFM Open License v1.0 | 0.065 - 0.075 |
| MioTTS-1.7B | 1.7B | Qwen3-1.7B-Base | Apache 2.0 | 0.10 - 0.11 |
| MioTTS-2.6B | 2.6B | LiquidAI/LFM2-2.6B | LFM Open License v1.0 | 0.135 - 0.145 |
RTF is measured when generating about 15 seconds of audio on an NVIDIA RTX 5090 with vLLM 0.15.1. Even the 0.1B model has an RTF of 0.04 to 0.05, meaning it can synthesize audio 20 to 25 times faster than real time.
The choice of base models is interesting. It spans Falcon H1, LiquidAI’s LFM2/LFM2.5, and Qwen3, so the project is testing different LLM architectures. LFM2 is Liquid AI’s hybrid architecture, which is not a Transformer and instead uses an SSM-style design. The 0.4B model having a lower RTF than the 0.6B one may be due to that architecture difference.
The training data covers English and Japanese with about 100,000 hours in total. The dataset includes Emilia-Dataset and HiFiTTS-2.
License notes
The base-model license varies. The Apache 2.0 Qwen3-based models (0.6B and 1.7B) are the most permissive. Falcon-LLM License and LFM Open License each have their own usage conditions, so commercial use requires checking the terms.
Benchmark: J-HARD-TTS-Eval
The 0.1B model card includes results from J-HARD-TTS-Eval. This is a Japanese zero-shot TTS benchmark from Parakeet Inc. that measures the robustness of autoregressive TTS models across four tasks:
- Continuation: generate the rest of partially provided text
- Repetition: stability under repeated patterns
- Rhyme: handling of rhyme patterns
- Short: stability on short sentences
The metrics are CER, or character error rate, lower is better, and SS, or speaker similarity, higher is better.
Selected results for the 0.1B model:
- Rhyme: best CER at 0.1419
- Repetition: best CER at 4.963
- Continuation: best CER at 0.2884
For a model this small, the peak performance is very strong. The tradeoff is that average scores and worst-case scores vary a lot, so stability is worse than larger models. Improving cases where there is very little context, such as the Short task, is still a future challenge.
SS, the speaker similarity score, is around 0.53 to 0.57. The model card notes that this reflects a tradeoff in the design: the codec handles voice cloning instead of the LLM. Global embeddings are compact, but they have limits when it comes to capturing finer-grained speaker characteristics.
GGUF support: runs in llama.cpp and Ollama
GGUF quantized versions of all six sizes are available in MioTTS-GGUF.
| Model | BF16 | Q8_0 | Q6_K | Q4_K_M |
|---|---|---|---|---|
| 0.1B | 232 MB | 125 MB | 97.3 MB | 79.6 MB |
| 0.4B | 736 MB | 392 MB | 304 MB | 239 MB |
| 0.6B | 1.22 GB | 653 MB | 506 MB | 408 MB |
| 1.2B | 2.39 GB | 1.27 GB | 983 MB | 751 MB |
| 1.7B | 3.5 GB | 1.86 GB | 1.44 GB | 1.13 GB |
| 2.6B | 5.19 GB | 2.76 GB | 2.13 GB | 1.58 GB |
The 0.1B Q4_K_M model is only 79.6MB. That feels small enough to run on a phone. Since the standard LLM inference stack works as-is, you can run it with llama.cpp or Ollama, call it through an OpenAI-compatible API, and have a separate MioTTS-Inference server handle codec decoding.
Setup
Inference server layout
git clone https://github.com/Aratako/MioTTS-Inference.git
cd MioTTS-Inference
uv sync
MAX_JOBS=8 uv pip install --no-build-isolation -v flash-attn
It starts in three steps:
1. Start an LLM inference server such as llama.cpp, Ollama, or vLLM
2. Start the TTS API server
python run_server.py --llm-base-url http://localhost:8000/v1
3. Start the web UI if desired
python run_gradio.py
It is available at http://localhost:7860.
Preset voices
The default presets are jp_female, jp_male, en_female, and en_male. You can also register your own presets.
python scripts/generate_preset.py --audio /path/to/audio.wav --preset-id my_voice
Recommended generation parameters
- Temperature: 0.8
- Top-p: 1.0
- Repetition penalty: 1.0
Relation to T5Gemma-TTS
Aratako also released T5Gemma-TTS in 2025. That model uses an encoder-decoder LLM architecture combining T5 and Gemma. MioTTS is the next step: it moves to a decoder-only LLM architecture and also develops its own codec.
The fact that MioTTS sample generation uses T5Gemma-TTS and Gemini 2.5 Pro TTS as reference audio suggests that the earlier project experience carried over into MioTTS.
Comparison with other open-source TTS projects
Here is a quick comparison with recently released open-source TTS systems.
| Project | Architecture | Japanese | Parameters | Inference framework |
|---|---|---|---|---|
| MioTTS | Decoder-only LLM + custom codec | Supported | 0.1B-2.6B | llama.cpp, Ollama, vLLM |
| Qwen3-TTS | Discrete multi-codebook LM | Supported | 0.6B-1.7B | Dedicated |
| CosyVoice 3 | LLM + DiT Flow Matching | Supported | - | Dedicated |
| Kokoro | StyleTTS2-based | Supported | 82M | Dedicated |
The biggest differentiator for MioTTS is that it can use existing LLM inference frameworks unchanged. TTS models that can be converted to GGUF and run in llama.cpp are still rare. For edge devices and local deployments, that compatibility is a major practical advantage.
It is also unusual as a personal project because it is built completely from scratch starting at the codec level. Most people would reuse existing codecs such as Encodec, DAC, or SpeechTokenizer, but this project instead trained a kanade-based codec on 88,000 hours of data and prepared both 24kHz and 44.1kHz variants.