LuxTTS - lightweight ZipVoice-based voice cloning that runs in 1 GB of VRAM

Open-source TTS models keep arriving every month, but LuxTTS stands out because it is optimized entirely for lightness. It fits in 1 GB of VRAM, runs at 150x real-time on GPU, and even runs faster than real time on CPU.

Repository: ysharma3501/LuxTTS
Hugging Face: YatharthS/LuxTTS
Demo: Hugging Face Spaces
Official site: luxtts.com
License: Apache 2.0 (commercial use allowed)

ZipVoice-based architecture

LuxTTS is based on ZipVoice, but distills the inference pipeline down to four steps. Instead of the multi-stage inference used by many TTS models, it aims to achieve quality comparable to a model ten times larger while using far fewer steps.

Main technical points:

Feature	Details
Four-step distillation	Compresses the ZipVoice inference pipeline into four steps
48 kHz vocoder	Many TTS systems output 24 kHz; LuxTTS outputs 48 kHz for cleaner audio
Improved sampling	Uses a custom sampling method instead of standard Euler sampling
1 GB VRAM	Runs on almost any consumer GPU

Specs and supported environments

Item	Details
VRAM	Under 1 GB
GPU speed	150x real time
CPU speed	Faster than real time
Output quality	48 kHz
Voice cloning	Zero-shot supported
Supported devices	CUDA / CPU / MPS (Apple Silicon)
License	Apache 2.0

English and Chinese are the main supported languages. The official site also lists Japanese, Korean, and French, but user reports for non-English quality are mixed. Do not expect native-level Japanese quality.

Setup and usage

git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt

Basic generation example:

from zipvoice.luxvoice import LuxTTS
import soundfile as sf

# Load model on GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU: lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Apple Silicon: lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

# Encode reference audio
encoded_prompt = lux_tts.encode_prompt('reference.wav', rms=0.01)

# Generate speech
text = "Hello, this is a test of LuxTTS voice cloning."
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

# Save
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

Important parameters:

Parameter	Default	Meaning
`num_steps`	4	Number of inference steps; 3-4 is usually the best balance of quality and speed
`rms`	0.01	Volume control
`t_shift`	0.9	Balance between quality and precision
`speed`	-	Playback speed adjustment
`ref_duration`	5 seconds	Length of the reference audio used

How good is the voice cloning?

150x real-time is undeniably fast, but the voice-cloning quality is mixed.

Reviews on HackerNoon note that text synthesis quality is good, but the voice-cloning side lags behind other models. It depends heavily on the quality of the reference audio, and noisy reference clips reduce cloning accuracy.

In a 2026 ranking of open-source voice-cloning models, Fish Speech V1.5, CosyVoice2-0.5B, and IndexTTS-2 took the top three spots, while LuxTTS was not in the top tier. This is the usual trade-off between lightness and quality.

Cost and deployment

LuxTTS’s biggest strength is how cheap it is to run.

Because it fits in 1 GB of VRAM, even an entry-level GPU from a few years ago is enough. It also runs faster than real time on CPU, so it is practical even without a GPU. There is no cloud API bill, no Docker requirement, and no ComfyUI requirement. A few lines of Python are enough to generate audio. The Apache 2.0 license also keeps commercial use simple.

Float16 optimization is still under development, and v1.5 is planned.

Comparison with the TTS models covered on this blog

Model	Parameters	VRAM	Speed	Japanese	Voice cloning	Clone quality
LuxTTS	Not disclosed (lightweight)	1 GB	150x RT	△	✅ (zero-shot)	△
Qwen3-TTS	0.6B / 1.7B	4-16 GB	97 ms streaming	✅	✅ (3 sec)	○
KugelAudio	7B	Unknown	AR + diffusion	❌ (24 European languages)	✅ (5-30 sec)	○
Pocket TTS	100M	CPU	Faster than real time	❌	✅	△
MioTTS	0.1B-2.6B	Variable	llama.cpp supported	✅ (JP/EN)	❌	-
MimikaStudio	Multiple engines	Variable	Depends on engine	Depends on engine	✅	Depends on engine

LuxTTS is basically “the lightest and fastest English TTS.” If you want Japanese quality, Qwen3-TTS or MioTTS is the better pick. If you care about cloning fidelity, Qwen3-TTS or KugelAudio is stronger. If you want the absolute minimum footprint, Pocket TTS is the reference. LuxTTS makes sense when you care most about latency and low resource use, especially for English-only conversational systems.

Personally, 1 GB VRAM with 48 kHz output is interesting. The weak point is still Japanese support and the fact that voice cloning is not yet competitive with the best models. If v1.5 brings Float16 support, the speed will improve further, and quality might improve too.