KugelAudio — Open‑Source 7B‑Parameter TTS (ComfyUI‑Compatible)
What Is KugelAudio?
An open‑source Text‑to‑Speech (TTS) model developed by the Hasso‑Plattner‑Institut. It is a large 7B‑parameter model that uses an AR (autoregressive) + diffusion architecture.
- Repository: Kugelaudio/kugelaudio-open
- HuggingFace: kugelaudio/kugelaudio-0-open
- ComfyUI node: Saganaki22/ComfyUI-KugelAudio
- License: MIT
- Training data: YODAS2 (~200k hours)
- Base: Microsoft VibeVoice + Qwen (LLM backbone)
Key Features
| Feature | Description |
|---|---|
| Single‑speaker TTS | Generate speech from text |
| Voice cloning | Clone a voice from a 5–30s reference audio |
| Multi‑speaker | Generate conversations with up to 6 speakers (use Speaker N: notation) |
| Watermark | Imperceptible watermark via AudioSeal (detector node available) |
| 4‑bit quantization | Reduce VRAM from ~19GB to ~8GB (CUDA only) |
| Attention options | SageAttention / FlashAttention / SDPA / Eager |
Supported languages: 24 European languages — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, and Turkish. Japanese is not included.
Benchmark
Results from the authors’ human A/B tests (n = 339).
| Rank | Model | Score | Win rate |
|---|---|---|---|
| 1 | KugelAudio | 26 | 78.0% |
| 2 | ElevenLabs Multi v2 | 25 | 62.2% |
| 3 | ElevenLabs v3 | 21 | 65.3% |
| 4 | Cartesia | 21 | 59.1% |
| 5 | VibeVoice | 10 | 28.8% |
| 6 | CosyVoice v3 | 9 | 14.2% |
Notably, it outscored ElevenLabs. That said, these are the authors’ own evaluations rather than independent third‑party results, so keep that in mind.
System Requirements
| Mode | VRAM | Notes |
|---|---|---|
| Full precision | ~19GB | bfloat16 |
| 4‑bit quantized | ~8GB | CUDA only; SDPA/Eager only |
Generation speed is roughly RTF (Real‑Time Factor) ≈ 1.0×. In other words, generating 10 seconds of audio takes about 10 seconds.
Apple Silicon (M1/M2/M3/M4)
MPS is supported, but stability is an issue.
Status
- Memory: With 64GB+ you can run full precision (~19GB)
- Precision: float16 on MPS (no bfloat16)
- 4‑bit quantization: Unavailable (bitsandbytes is CUDA‑only)
Known Issues
These caveats are noted in the README:
mps_matmulerrors may occur- Sometimes you’ll see “incompatible dimensions” or “LLVM ERROR”
- If the above errors show up, switch the Device setting to
cpu
Practical Options
- Try MPS first
- If errors occur, switch to CPU mode (much slower)
- If you need practical speed, consider a cloud GPU (e.g., RunPod)
Comparison With Other TTS
Compared with TTS engines previously covered on this blog:
| Model | Parameters | Runtime | Japanese | Voice cloning |
|---|---|---|---|---|
| KugelAudio | 7B | GPU (19GB) / 4‑bit (8GB) | ❌ | ✅ |
| Pocket TTS | 100M | CPU | ❌ | ✅ |
| VOICEVOX | - | CPU | ✅ | ❌ |
| Style‑Bert‑VITS2 | - | GPU recommended | ✅ | ✅ |
KugelAudio is a large 7B model focused on quality. If you need Japanese, you’ll likely use VOICEVOX or Style‑Bert‑VITS2 instead.
Using With ComfyUI
Install via ComfyUI Manager by searching for “KugelAudio,” or clone manually.
cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-KugelAudio.git
On first launch, the model (~14GB) is downloaded automatically.
Core Nodes
- KugelAudio TTS: text → speech
- KugelAudio Voice Clone: reference audio + text → speech
- KugelAudio Multi‑Speaker: multi‑speaker conversation generation
- KugelAudio Watermark Check: detect watermark in generated audio
Parameters
cfg_scale: guidance scale (1.0–10.0; default 3.0)max_new_tokens: max generation length (512–4096; default 2048)use_4bit: 4‑bit quantization (CUDA only)attention_type: auto / sage_attn / flash_attn / sdpa / eagerkeep_loaded: keep the model in VRAM (faster for consecutive generations)
Related Articles
- Building a talkable AI environment (1): voice API survey — comparison of TTS APIs
- Pocket TTS — lightweight text‑to‑speech that runs on CPU — an ultra‑lightweight 100M TTS
- Specs for running Qwen‑Image‑Edit‑2511 locally — notes on quantization