Pocket TTS — Lightweight Text-to-Speech on CPU
What Is Pocket TTS
An open-source text-to-speech (TTS) model released by Kyutai Labs. Highlights:
- Lightweight with 100M parameters
- Faster than real time on CPU (no GPU required)
- Voice cloning supported (provide a WAV file to synthesize in any voice)
- MIT License
Published as the implementation of the paper “Continuous Audio Language Models” (arXiv:2509.06926).
CALM’s Approach
Traditional audio language models (ALMs) treat audio as a sequence of discrete tokens. A neural network compresses the waveform and selects the closest pattern IDs from a codebook to form a sequence—conceptually similar to lossy compression like MP3.
This approach has issues:
- Tokens originate from a lossy compression stage, so bitrate is constrained.
- Improving audio quality requires more tokens, which increases compute cost.
CALM (Continuous Audio Language Model) takes a continuous approach:
- A Transformer backbone produces contextual embeddings at each time step.
- An MLP continuously generates the next frame of an audio VAE.
By avoiding lossy quantization, it achieves high-quality audio generation at low compute cost.
Loss Function
Training the VAE uses the following objective:
are the weights for each term, and denotes losses (lower is better).
- : Time-domain reconstruction loss (how close the waveform is)
- : Frequency-domain reconstruction loss (how close the spectrum is)
- : Adversarial loss (GAN “realness” criterion)
- : Feature-matching loss (similarity of intermediate features)
- : KL regularization (latent distribution close to Gaussian)
- : WavLM distillation loss (learns semantic information from an ASR model)
is particularly interesting. If you only match waveforms or spectra, you’ll encounter cases that are “numerically similar but sound different to humans,” and vice versa. By distilling from an ASR model (WavLM) and incorporating whether it recognizes two utterances as the same content into the loss, the training optimizes for a notion of “sounds similar” closer to human perception.
Replacing conventional RVQ (Residual Vector Quantization) with a VAE bottleneck constrained to a Gaussian distribution eliminates quantization loss. In addition, a Consistency Model enables single-step inference for speed.
For the basics of audio analysis (FFT and correlation coefficients), see the karaoke scoring article.
Installation
pip install pocket-tts
Requirements:
- Python 3.10–3.14
- PyTorch 2.5 or later
Usage
CLI
# プリセット音声で生成
pocket-tts generate --voice alba --text "Hello, world!"
# カスタム音声(ボイスクローニング)
pocket-tts generate --voice /path/to/voice.wav --text "Hello, world!"
There are eight preset voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma
Python API
from pocket_tts import TTSModel
# モデル読み込み
tts = TTSModel.load_model()
# プリセット音声で生成
voice_state = tts.get_state_for_voice("alba")
audio = tts.generate_audio(voice_state, "Hello, world!")
# カスタム音声で生成
voice_state = tts.get_state_for_audio_prompt("/path/to/voice.wav")
audio = tts.generate_audio(voice_state, "Hello, world!")
Local Web Server
pocket-tts serve
# http://localhost:8000 でアクセス
Notes on Voice Cloning
You can synthesize speech in any voice by providing a WAV file, but the terms of use prohibit the following:
- Voice cloning without the person’s explicit consent
- Use for impersonation or forgery
Using your own voice or a voice you have permission to use is fine.