Tech 4 min read

LuxTTS - lightweight ZipVoice-based voice cloning that runs in 1 GB of VRAM

IkesanContents

Open-source TTS models keep arriving every month, but LuxTTS stands out because it is optimized entirely for lightness. It fits in 1 GB of VRAM, runs at 150x real-time on GPU, and even runs faster than real time on CPU.

ZipVoice-based architecture

LuxTTS is based on ZipVoice, but distills the inference pipeline down to four steps. Instead of the multi-stage inference used by many TTS models, it aims to achieve quality comparable to a model ten times larger while using far fewer steps.

Main technical points:

FeatureDetails
Four-step distillationCompresses the ZipVoice inference pipeline into four steps
48 kHz vocoderMany TTS systems output 24 kHz; LuxTTS outputs 48 kHz for cleaner audio
Improved samplingUses a custom sampling method instead of standard Euler sampling
1 GB VRAMRuns on almost any consumer GPU

Specs and supported environments

ItemDetails
VRAMUnder 1 GB
GPU speed150x real time
CPU speedFaster than real time
Output quality48 kHz
Voice cloningZero-shot supported
Supported devicesCUDA / CPU / MPS (Apple Silicon)
LicenseApache 2.0

English and Chinese are the main supported languages. The official site also lists Japanese, Korean, and French, but user reports for non-English quality are mixed. Do not expect native-level Japanese quality.

Setup and usage

git clone https://github.com/ysharma3501/LuxTTS.git
cd LuxTTS
pip install -r requirements.txt

Basic generation example:

from zipvoice.luxvoice import LuxTTS
import soundfile as sf

# Load model on GPU
lux_tts = LuxTTS('YatharthS/LuxTTS', device='cuda')
# CPU: lux_tts = LuxTTS('YatharthS/LuxTTS', device='cpu', threads=2)
# Apple Silicon: lux_tts = LuxTTS('YatharthS/LuxTTS', device='mps')

# Encode reference audio
encoded_prompt = lux_tts.encode_prompt('reference.wav', rms=0.01)

# Generate speech
text = "Hello, this is a test of LuxTTS voice cloning."
final_wav = lux_tts.generate_speech(text, encoded_prompt, num_steps=4)

# Save
final_wav = final_wav.numpy().squeeze()
sf.write('output.wav', final_wav, 48000)

Important parameters:

ParameterDefaultMeaning
num_steps4Number of inference steps; 3-4 is usually the best balance of quality and speed
rms0.01Volume control
t_shift0.9Balance between quality and precision
speed-Playback speed adjustment
ref_duration5 secondsLength of the reference audio used

How good is the voice cloning?

150x real-time is undeniably fast, but the voice-cloning quality is mixed.

Reviews on HackerNoon note that text synthesis quality is good, but the voice-cloning side lags behind other models. It depends heavily on the quality of the reference audio, and noisy reference clips reduce cloning accuracy.

In a 2026 ranking of open-source voice-cloning models, Fish Speech V1.5, CosyVoice2-0.5B, and IndexTTS-2 took the top three spots, while LuxTTS was not in the top tier. This is the usual trade-off between lightness and quality.

Cost and deployment

LuxTTS’s biggest strength is how cheap it is to run.

Because it fits in 1 GB of VRAM, even an entry-level GPU from a few years ago is enough. It also runs faster than real time on CPU, so it is practical even without a GPU. There is no cloud API bill, no Docker requirement, and no ComfyUI requirement. A few lines of Python are enough to generate audio. The Apache 2.0 license also keeps commercial use simple.

Float16 optimization is still under development, and v1.5 is planned.

Comparison with the TTS models covered on this blog

ModelParametersVRAMSpeedJapaneseVoice cloningClone quality
LuxTTSNot disclosed (lightweight)1 GB150x RT✅ (zero-shot)
Qwen3-TTS0.6B / 1.7B4-16 GB97 ms streaming✅ (3 sec)
KugelAudio7BUnknownAR + diffusion❌ (24 European languages)✅ (5-30 sec)
Pocket TTS100MCPUFaster than real time
MioTTS0.1B-2.6BVariablellama.cpp supported✅ (JP/EN)-
MimikaStudioMultiple enginesVariableDepends on engineDepends on engineDepends on engine

LuxTTS is basically “the lightest and fastest English TTS.” If you want Japanese quality, Qwen3-TTS or MioTTS is the better pick. If you care about cloning fidelity, Qwen3-TTS or KugelAudio is stronger. If you want the absolute minimum footprint, Pocket TTS is the reference. LuxTTS makes sense when you care most about latency and low resource use, especially for English-only conversational systems.


Personally, 1 GB VRAM with 48 kHz output is interesting. The weak point is still Japanese support and the fact that voice cloning is not yet competitive with the best models. If v1.5 brings Float16 support, the speed will improve further, and quality might improve too.