MioTTS - a lightweight LLM-based TTS built from a custom codec

On February 11, 2026, Aratako released a family of lightweight Japanese-English TTS models. As the author says, they were “developed from codec to scratch,” and the release includes everything from the audio codec and vocoder to the TTS model and inference server.

Collection: Aratako/MioTTS
Inference code: Aratako/MioTTS-Inference
Codec: Aratako/MioCodec
License: varies by model size, see below

The big picture

MioTTS is a TTS model built directly on an LLM architecture. You input text, it autoregressively generates audio tokens, and those tokens are decoded into waveforms by a codec. The biggest advantage is that standard LLM inference frameworks work as-is, so tools like llama.cpp, Ollama, vLLM, and other systems that expose an OpenAI-compatible API can be used directly.

The stack has three pieces:

MioCodec - a neural audio codec that separates audio into content tokens and a global embedding
MioTTS itself - an autoregressive LLM-based TTS model ranging from 0.1B to 2.6B
MioTTS-Inference - an inference server with a REST API and Gradio web UI

MioCodec: the custom audio codec

To understand MioTTS, you need to start with MioCodec.

MioCodec is a neural audio codec based on kanade-tokenizer. It splits audio into two components:

Content tokens: discrete tokens that represent linguistic information and phonetic content at a 25Hz frame rate
Global embedding: a continuous vector that represents speaker identity, recording environment, and microphone characteristics

That split is the key to MioTTS’s efficiency. The LLM part only has to generate content tokens, while the speaker information is provided separately as a global embedding. Zero-shot voice cloning is just a matter of combining the source content tokens with the target speaker’s global embedding.

Specs

Item	MioCodec 24kHz	MioCodec 44.1kHz
Token rate	25 Hz	25 Hz
Bitrate	341 bps	341 bps
Sample rate	24 kHz	44.1 kHz
Vocoder	iSTFTHead (built in)	MioVocoder (external)
Parameters	132M	118M (excluding vocoder)
Use case	Lightweight, fast inference	Higher quality audio processing

The 24kHz version integrates iSTFTHead, so it can generate waveforms directly without an external vocoder. The 44.1kHz version is jointly tuned with MioVocoder for studio-quality audio processing.

Training

It was trained on 9 languages and more than 88,000 hours of speech data. The SSL encoder uses WavLM-base+, and the training process has two stages for the 24kHz version or three stages for the 44.1kHz version.

Phase 1: feature alignment using multi-resolution mel spectrogram loss plus SSL feature reconstruction loss
Phase 2: perceptual quality improvements through adversarial training with a multi-period discriminator and multi-scale STFT discriminator
Phase 3 (44.1kHz only): end-to-end joint tuning of the vocoder and codec decoder

The kanade paper, published on arXiv in February 2026, introduced a single-layer separative audio tokenizer. MioCodec extends that architecture by adding an integrated waveform decoder via iSTFTHead and by adjusting the vocabulary size.

The model family

There are six sizes, each initialized from a different base LLM.

Model	Parameters	Base model	License	RTF
MioTTS-0.1B	0.1B	Falcon-H1-Tiny-Multilingual-100M	Falcon-LLM License	0.04 - 0.05
MioTTS-0.4B	0.4B	LiquidAI/LFM2-350M	LFM Open License v1.0	0.035 - 0.045
MioTTS-0.6B	0.6B	Qwen3-0.6B-Base	Apache 2.0	0.055 - 0.065
MioTTS-1.2B	1.2B	LiquidAI/LFM2.5-1.2B-Base	LFM Open License v1.0	0.065 - 0.075
MioTTS-1.7B	1.7B	Qwen3-1.7B-Base	Apache 2.0	0.10 - 0.11
MioTTS-2.6B	2.6B	LiquidAI/LFM2-2.6B	LFM Open License v1.0	0.135 - 0.145

RTF is measured when generating about 15 seconds of audio on an NVIDIA RTX 5090 with vLLM 0.15.1. Even the 0.1B model has an RTF of 0.04 to 0.05, meaning it can synthesize audio 20 to 25 times faster than real time.

The choice of base models is interesting. It spans Falcon H1, LiquidAI’s LFM2/LFM2.5, and Qwen3, so the project is testing different LLM architectures. LFM2 is Liquid AI’s hybrid architecture, which is not a Transformer and instead uses an SSM-style design. The 0.4B model having a lower RTF than the 0.6B one may be due to that architecture difference.

The training data covers English and Japanese with about 100,000 hours in total. The dataset includes Emilia-Dataset and HiFiTTS-2.

License notes

The base-model license varies. The Apache 2.0 Qwen3-based models (0.6B and 1.7B) are the most permissive. Falcon-LLM License and LFM Open License each have their own usage conditions, so commercial use requires checking the terms.

Benchmark: J-HARD-TTS-Eval

The 0.1B model card includes results from J-HARD-TTS-Eval. This is a Japanese zero-shot TTS benchmark from Parakeet Inc. that measures the robustness of autoregressive TTS models across four tasks:

Continuation: generate the rest of partially provided text
Repetition: stability under repeated patterns
Rhyme: handling of rhyme patterns
Short: stability on short sentences

The metrics are CER, or character error rate, lower is better, and SS, or speaker similarity, higher is better.

Selected results for the 0.1B model:

Rhyme: best CER at 0.1419
Repetition: best CER at 4.963
Continuation: best CER at 0.2884

For a model this small, the peak performance is very strong. The tradeoff is that average scores and worst-case scores vary a lot, so stability is worse than larger models. Improving cases where there is very little context, such as the Short task, is still a future challenge.

SS, the speaker similarity score, is around 0.53 to 0.57. The model card notes that this reflects a tradeoff in the design: the codec handles voice cloning instead of the LLM. Global embeddings are compact, but they have limits when it comes to capturing finer-grained speaker characteristics.

GGUF support: runs in llama.cpp and Ollama

GGUF quantized versions of all six sizes are available in MioTTS-GGUF.

Model	BF16	Q8_0	Q6_K	Q4_K_M
0.1B	232 MB	125 MB	97.3 MB	79.6 MB
0.4B	736 MB	392 MB	304 MB	239 MB
0.6B	1.22 GB	653 MB	506 MB	408 MB
1.2B	2.39 GB	1.27 GB	983 MB	751 MB
1.7B	3.5 GB	1.86 GB	1.44 GB	1.13 GB
2.6B	5.19 GB	2.76 GB	2.13 GB	1.58 GB

The 0.1B Q4_K_M model is only 79.6MB. That feels small enough to run on a phone. Since the standard LLM inference stack works as-is, you can run it with llama.cpp or Ollama, call it through an OpenAI-compatible API, and have a separate MioTTS-Inference server handle codec decoding.

Setup

Inference server layout

git clone https://github.com/Aratako/MioTTS-Inference.git
cd MioTTS-Inference
uv sync
MAX_JOBS=8 uv pip install --no-build-isolation -v flash-attn

It starts in three steps:

1. Start an LLM inference server such as llama.cpp, Ollama, or vLLM

2. Start the TTS API server

python run_server.py --llm-base-url http://localhost:8000/v1

3. Start the web UI if desired

python run_gradio.py

It is available at http://localhost:7860.

Preset voices

The default presets are jp_female, jp_male, en_female, and en_male. You can also register your own presets.

python scripts/generate_preset.py --audio /path/to/audio.wav --preset-id my_voice

Recommended generation parameters

Temperature: 0.8
Top-p: 1.0
Repetition penalty: 1.0

Relation to T5Gemma-TTS

Aratako also released T5Gemma-TTS in 2025. That model uses an encoder-decoder LLM architecture combining T5 and Gemma. MioTTS is the next step: it moves to a decoder-only LLM architecture and also develops its own codec.

The fact that MioTTS sample generation uses T5Gemma-TTS and Gemini 2.5 Pro TTS as reference audio suggests that the earlier project experience carried over into MioTTS.

Comparison with other open-source TTS projects

Here is a quick comparison with recently released open-source TTS systems.

Project	Architecture	Japanese	Parameters	Inference framework
MioTTS	Decoder-only LLM + custom codec	Supported	0.1B-2.6B	llama.cpp, Ollama, vLLM
Qwen3-TTS	Discrete multi-codebook LM	Supported	0.6B-1.7B	Dedicated
CosyVoice 3	LLM + DiT Flow Matching	Supported	-	Dedicated
Kokoro	StyleTTS2-based	Supported	82M	Dedicated

The biggest differentiator for MioTTS is that it can use existing LLM inference frameworks unchanged. TTS models that can be converted to GGUF and run in llama.cpp are still rare. For edge devices and local deployments, that compatibility is a major practical advantage.

It is also unusual as a personal project because it is built completely from scratch starting at the codec level. Most people would reuse existing codecs such as Encodec, DAC, or SpeechTokenizer, but this project instead trained a kanade-based codec on 88,000 hours of data and prepared both 24kHz and 44.1kHz variants.