Tech 4 min read

Pocket TTS — Lightweight Text-to-Speech on CPU

What Is Pocket TTS

An open-source text-to-speech (TTS) model released by Kyutai Labs. Highlights:

  • Lightweight with 100M parameters
  • Faster than real time on CPU (no GPU required)
  • Voice cloning supported (provide a WAV file to synthesize in any voice)
  • MIT License

Published as the implementation of the paper “Continuous Audio Language Models” (arXiv:2509.06926).

CALM’s Approach

Traditional audio language models (ALMs) treat audio as a sequence of discrete tokens. A neural network compresses the waveform and selects the closest pattern IDs from a codebook to form a sequence—conceptually similar to lossy compression like MP3.

This approach has issues:

  • Tokens originate from a lossy compression stage, so bitrate is constrained.
  • Improving audio quality requires more tokens, which increases compute cost.

CALM (Continuous Audio Language Model) takes a continuous approach:

  1. A Transformer backbone produces contextual embeddings at each time step.
  2. An MLP continuously generates the next frame of an audio VAE.

By avoiding lossy quantization, it achieves high-quality audio generation at low compute cost.

Loss Function

Training the VAE uses the following objective:

LVAE=λtLt+λfLf+λadvLadv+λfeatLfeat+λKLLKL+λdistillLdistill\mathcal{L}_{\text{VAE}} = \lambda_t \mathcal{L}_t + \lambda_f \mathcal{L}_f + \lambda_{\text{adv}} \mathcal{L}_{\text{adv}} + \lambda_{\text{feat}} \mathcal{L}_{\text{feat}} + \lambda_{\text{KL}} \mathcal{L}_{\text{KL}} + \lambda_{\text{distill}} \mathcal{L}_{\text{distill}}

λ\lambda are the weights for each term, and L\mathcal{L} denotes losses (lower is better).

  • Lt\mathcal{L}_t: Time-domain reconstruction loss (how close the waveform is)
  • Lf\mathcal{L}_f: Frequency-domain reconstruction loss (how close the spectrum is)
  • Ladv\mathcal{L}_{\text{adv}}: Adversarial loss (GAN “realness” criterion)
  • Lfeat\mathcal{L}_{\text{feat}}: Feature-matching loss (similarity of intermediate features)
  • LKL\mathcal{L}_{\text{KL}}: KL regularization (latent distribution close to Gaussian)
  • Ldistill\mathcal{L}_{\text{distill}}: WavLM distillation loss (learns semantic information from an ASR model)

Ldistill\mathcal{L}_{\text{distill}} is particularly interesting. If you only match waveforms or spectra, you’ll encounter cases that are “numerically similar but sound different to humans,” and vice versa. By distilling from an ASR model (WavLM) and incorporating whether it recognizes two utterances as the same content into the loss, the training optimizes for a notion of “sounds similar” closer to human perception.

Replacing conventional RVQ (Residual Vector Quantization) with a VAE bottleneck constrained to a Gaussian distribution eliminates quantization loss. In addition, a Consistency Model enables single-step inference for speed.

For the basics of audio analysis (FFT and correlation coefficients), see the karaoke scoring article.

Installation

pip install pocket-tts

Requirements:

  • Python 3.10–3.14
  • PyTorch 2.5 or later

Usage

CLI

# プリセット音声で生成
pocket-tts generate --voice alba --text "Hello, world!"

# カスタム音声(ボイスクローニング)
pocket-tts generate --voice /path/to/voice.wav --text "Hello, world!"

There are eight preset voices: alba, marius, javert, jean, fantine, cosette, eponine, azelma

Python API

from pocket_tts import TTSModel

# モデル読み込み
tts = TTSModel.load_model()

# プリセット音声で生成
voice_state = tts.get_state_for_voice("alba")
audio = tts.generate_audio(voice_state, "Hello, world!")

# カスタム音声で生成
voice_state = tts.get_state_for_audio_prompt("/path/to/voice.wav")
audio = tts.generate_audio(voice_state, "Hello, world!")

Local Web Server

pocket-tts serve
# http://localhost:8000 でアクセス

Notes on Voice Cloning

You can synthesize speech in any voice by providing a WAV file, but the terms of use prohibit the following:

  • Voice cloning without the person’s explicit consent
  • Use for impersonation or forgery

Using your own voice or a voice you have permission to use is fine.