Tech 7 min read

MioTTS - a lightweight LLM-based TTS built from a custom codec

IkesanContents

On February 11, 2026, Aratako released a family of lightweight Japanese-English TTS models. As the author says, they were “developed from codec to scratch,” and the release includes everything from the audio codec and vocoder to the TTS model and inference server.

The big picture

MioTTS is a TTS model built directly on an LLM architecture. You input text, it autoregressively generates audio tokens, and those tokens are decoded into waveforms by a codec. The biggest advantage is that standard LLM inference frameworks work as-is, so tools like llama.cpp, Ollama, vLLM, and other systems that expose an OpenAI-compatible API can be used directly.

The stack has three pieces:

  1. MioCodec - a neural audio codec that separates audio into content tokens and a global embedding
  2. MioTTS itself - an autoregressive LLM-based TTS model ranging from 0.1B to 2.6B
  3. MioTTS-Inference - an inference server with a REST API and Gradio web UI

MioCodec: the custom audio codec

To understand MioTTS, you need to start with MioCodec.

MioCodec is a neural audio codec based on kanade-tokenizer. It splits audio into two components:

  • Content tokens: discrete tokens that represent linguistic information and phonetic content at a 25Hz frame rate
  • Global embedding: a continuous vector that represents speaker identity, recording environment, and microphone characteristics

That split is the key to MioTTS’s efficiency. The LLM part only has to generate content tokens, while the speaker information is provided separately as a global embedding. Zero-shot voice cloning is just a matter of combining the source content tokens with the target speaker’s global embedding.

Specs

ItemMioCodec 24kHzMioCodec 44.1kHz
Token rate25 Hz25 Hz
Bitrate341 bps341 bps
Sample rate24 kHz44.1 kHz
VocoderiSTFTHead (built in)MioVocoder (external)
Parameters132M118M (excluding vocoder)
Use caseLightweight, fast inferenceHigher quality audio processing

The 24kHz version integrates iSTFTHead, so it can generate waveforms directly without an external vocoder. The 44.1kHz version is jointly tuned with MioVocoder for studio-quality audio processing.

Training

It was trained on 9 languages and more than 88,000 hours of speech data. The SSL encoder uses WavLM-base+, and the training process has two stages for the 24kHz version or three stages for the 44.1kHz version.

  1. Phase 1: feature alignment using multi-resolution mel spectrogram loss plus SSL feature reconstruction loss
  2. Phase 2: perceptual quality improvements through adversarial training with a multi-period discriminator and multi-scale STFT discriminator
  3. Phase 3 (44.1kHz only): end-to-end joint tuning of the vocoder and codec decoder

The kanade paper, published on arXiv in February 2026, introduced a single-layer separative audio tokenizer. MioCodec extends that architecture by adding an integrated waveform decoder via iSTFTHead and by adjusting the vocabulary size.

The model family

There are six sizes, each initialized from a different base LLM.

ModelParametersBase modelLicenseRTF
MioTTS-0.1B0.1BFalcon-H1-Tiny-Multilingual-100MFalcon-LLM License0.04 - 0.05
MioTTS-0.4B0.4BLiquidAI/LFM2-350MLFM Open License v1.00.035 - 0.045
MioTTS-0.6B0.6BQwen3-0.6B-BaseApache 2.00.055 - 0.065
MioTTS-1.2B1.2BLiquidAI/LFM2.5-1.2B-BaseLFM Open License v1.00.065 - 0.075
MioTTS-1.7B1.7BQwen3-1.7B-BaseApache 2.00.10 - 0.11
MioTTS-2.6B2.6BLiquidAI/LFM2-2.6BLFM Open License v1.00.135 - 0.145

RTF is measured when generating about 15 seconds of audio on an NVIDIA RTX 5090 with vLLM 0.15.1. Even the 0.1B model has an RTF of 0.04 to 0.05, meaning it can synthesize audio 20 to 25 times faster than real time.

The choice of base models is interesting. It spans Falcon H1, LiquidAI’s LFM2/LFM2.5, and Qwen3, so the project is testing different LLM architectures. LFM2 is Liquid AI’s hybrid architecture, which is not a Transformer and instead uses an SSM-style design. The 0.4B model having a lower RTF than the 0.6B one may be due to that architecture difference.

The training data covers English and Japanese with about 100,000 hours in total. The dataset includes Emilia-Dataset and HiFiTTS-2.

License notes

The base-model license varies. The Apache 2.0 Qwen3-based models (0.6B and 1.7B) are the most permissive. Falcon-LLM License and LFM Open License each have their own usage conditions, so commercial use requires checking the terms.

Benchmark: J-HARD-TTS-Eval

The 0.1B model card includes results from J-HARD-TTS-Eval. This is a Japanese zero-shot TTS benchmark from Parakeet Inc. that measures the robustness of autoregressive TTS models across four tasks:

  • Continuation: generate the rest of partially provided text
  • Repetition: stability under repeated patterns
  • Rhyme: handling of rhyme patterns
  • Short: stability on short sentences

The metrics are CER, or character error rate, lower is better, and SS, or speaker similarity, higher is better.

Selected results for the 0.1B model:

  • Rhyme: best CER at 0.1419
  • Repetition: best CER at 4.963
  • Continuation: best CER at 0.2884

For a model this small, the peak performance is very strong. The tradeoff is that average scores and worst-case scores vary a lot, so stability is worse than larger models. Improving cases where there is very little context, such as the Short task, is still a future challenge.

SS, the speaker similarity score, is around 0.53 to 0.57. The model card notes that this reflects a tradeoff in the design: the codec handles voice cloning instead of the LLM. Global embeddings are compact, but they have limits when it comes to capturing finer-grained speaker characteristics.

GGUF support: runs in llama.cpp and Ollama

GGUF quantized versions of all six sizes are available in MioTTS-GGUF.

ModelBF16Q8_0Q6_KQ4_K_M
0.1B232 MB125 MB97.3 MB79.6 MB
0.4B736 MB392 MB304 MB239 MB
0.6B1.22 GB653 MB506 MB408 MB
1.2B2.39 GB1.27 GB983 MB751 MB
1.7B3.5 GB1.86 GB1.44 GB1.13 GB
2.6B5.19 GB2.76 GB2.13 GB1.58 GB

The 0.1B Q4_K_M model is only 79.6MB. That feels small enough to run on a phone. Since the standard LLM inference stack works as-is, you can run it with llama.cpp or Ollama, call it through an OpenAI-compatible API, and have a separate MioTTS-Inference server handle codec decoding.

Setup

Inference server layout

git clone https://github.com/Aratako/MioTTS-Inference.git
cd MioTTS-Inference
uv sync
MAX_JOBS=8 uv pip install --no-build-isolation -v flash-attn

It starts in three steps:

1. Start an LLM inference server such as llama.cpp, Ollama, or vLLM

2. Start the TTS API server

python run_server.py --llm-base-url http://localhost:8000/v1

3. Start the web UI if desired

python run_gradio.py

It is available at http://localhost:7860.

Preset voices

The default presets are jp_female, jp_male, en_female, and en_male. You can also register your own presets.

python scripts/generate_preset.py --audio /path/to/audio.wav --preset-id my_voice
  • Temperature: 0.8
  • Top-p: 1.0
  • Repetition penalty: 1.0

Relation to T5Gemma-TTS

Aratako also released T5Gemma-TTS in 2025. That model uses an encoder-decoder LLM architecture combining T5 and Gemma. MioTTS is the next step: it moves to a decoder-only LLM architecture and also develops its own codec.

The fact that MioTTS sample generation uses T5Gemma-TTS and Gemini 2.5 Pro TTS as reference audio suggests that the earlier project experience carried over into MioTTS.

Comparison with other open-source TTS projects

Here is a quick comparison with recently released open-source TTS systems.

ProjectArchitectureJapaneseParametersInference framework
MioTTSDecoder-only LLM + custom codecSupported0.1B-2.6Bllama.cpp, Ollama, vLLM
Qwen3-TTSDiscrete multi-codebook LMSupported0.6B-1.7BDedicated
CosyVoice 3LLM + DiT Flow MatchingSupported-Dedicated
KokoroStyleTTS2-basedSupported82MDedicated

The biggest differentiator for MioTTS is that it can use existing LLM inference frameworks unchanged. TTS models that can be converted to GGUF and run in llama.cpp are still rare. For edge devices and local deployments, that compatibility is a major practical advantage.

It is also unusual as a personal project because it is built completely from scratch starting at the codec level. Most people would reuse existing codecs such as Encodec, DAC, or SpeechTokenizer, but this project instead trained a kanade-based codec on 88,000 hours of data and prepared both 24kHz and 44.1kHz variants.