LuxTTS - lightweight ZipVoice-based voice cloning that runs in 1 GB of VRAM
Contents
Open-source TTS models keep arriving every month, but LuxTTS stands out because it is optimized entirely for lightness. It fits in 1 GB of VRAM, runs at 150x real-time on GPU, and even runs faster than real time on CPU.
- Repository: ysharma3501/LuxTTS
- Hugging Face: YatharthS/LuxTTS
- Demo: Hugging Face Spaces
- Official site: luxtts.com
- License: Apache 2.0 (commercial use allowed)
ZipVoice-based architecture
LuxTTS is based on ZipVoice, but distills the inference pipeline down to four steps. Instead of the multi-stage inference used by many TTS models, it aims to achieve quality comparable to a model ten times larger while using far fewer steps.
Main technical points:
| Feature | Details |
|---|---|
| Four-step distillation | Compresses the ZipVoice inference pipeline into four steps |
| 48 kHz vocoder | Many TTS systems output 24 kHz; LuxTTS outputs 48 kHz for cleaner audio |
| Improved sampling | Uses a custom sampling method instead of standard Euler sampling |
| 1 GB VRAM | Runs on almost any consumer GPU |
Specs and supported environments
| Item | Details |
|---|---|
| VRAM | Under 1 GB |
| GPU speed | 150x real time |
| CPU speed | Faster than real time |
| Output quality | 48 kHz |
| Voice cloning | Zero-shot supported |
| Supported devices | CUDA / CPU / MPS (Apple Silicon) |
| License | Apache 2.0 |
English and Chinese are the main supported languages. The official site also lists Japanese, Korean, and French, but user reports for non-English quality are mixed. Do not expect native-level Japanese quality.
Setup and usage
git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt
Basic generation example:
from zipvoice.luxvoice import LuxTTS
import soundfile as sf
# Load model on GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU: lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Apple Silicon: lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')
# Encode reference audio
encoded_prompt = lux_tts.encode_prompt('reference.wav', rms=0.01)
# Generate speech
text = "Hello, this is a test of LuxTTS voice cloning."
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)
# Save
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)
Important parameters:
| Parameter | Default | Meaning |
|---|---|---|
num_steps | 4 | Number of inference steps; 3-4 is usually the best balance of quality and speed |
rms | 0.01 | Volume control |
t_shift | 0.9 | Balance between quality and precision |
speed | - | Playback speed adjustment |
ref_duration | 5 seconds | Length of the reference audio used |
How good is the voice cloning?
150x real-time is undeniably fast, but the voice-cloning quality is mixed.
Reviews on HackerNoon note that text synthesis quality is good, but the voice-cloning side lags behind other models. It depends heavily on the quality of the reference audio, and noisy reference clips reduce cloning accuracy.
In a 2026 ranking of open-source voice-cloning models, Fish Speech V1.5, CosyVoice2-0.5B, and IndexTTS-2 took the top three spots, while LuxTTS was not in the top tier. This is the usual trade-off between lightness and quality.
Cost and deployment
LuxTTS’s biggest strength is how cheap it is to run.
Because it fits in 1 GB of VRAM, even an entry-level GPU from a few years ago is enough. It also runs faster than real time on CPU, so it is practical even without a GPU. There is no cloud API bill, no Docker requirement, and no ComfyUI requirement. A few lines of Python are enough to generate audio. The Apache 2.0 license also keeps commercial use simple.
Float16 optimization is still under development, and v1.5 is planned.
Comparison with the TTS models covered on this blog
| Model | Parameters | VRAM | Speed | Japanese | Voice cloning | Clone quality |
|---|---|---|---|---|---|---|
| LuxTTS | Not disclosed (lightweight) | 1 GB | 150x RT | △ | ✅ (zero-shot) | △ |
| Qwen3-TTS | 0.6B / 1.7B | 4-16 GB | 97 ms streaming | ✅ | ✅ (3 sec) | ○ |
| KugelAudio | 7B | Unknown | AR + diffusion | ❌ (24 European languages) | ✅ (5-30 sec) | ○ |
| Pocket TTS | 100M | CPU | Faster than real time | ❌ | ✅ | △ |
| MioTTS | 0.1B-2.6B | Variable | llama.cpp supported | ✅ (JP/EN) | ❌ | - |
| MimikaStudio | Multiple engines | Variable | Depends on engine | Depends on engine | ✅ | Depends on engine |
LuxTTS is basically “the lightest and fastest English TTS.” If you want Japanese quality, Qwen3-TTS or MioTTS is the better pick. If you care about cloning fidelity, Qwen3-TTS or KugelAudio is stronger. If you want the absolute minimum footprint, Pocket TTS is the reference. LuxTTS makes sense when you care most about latency and low resource use, especially for English-only conversational systems.
Personally, 1 GB VRAM with 48 kHz output is interesting. The weak point is still Japanese support and the fact that voice cloning is not yet competitive with the best models. If v1.5 brings Float16 support, the speed will improve further, and quality might improve too.