KugelAudio — Open‑Source 7B‑Parameter TTS (ComfyUI‑Compatible)

Update (2026-04-23): Fixed broken internal links.

What Is KugelAudio?

An open‑source Text‑to‑Speech (TTS) model developed by the Hasso‑Plattner‑Institut. It is a large 7B‑parameter model that uses an AR (autoregressive) + diffusion architecture.

Repository: Kugelaudio/kugelaudio-open
HuggingFace: kugelaudio/kugelaudio-0-open
ComfyUI node: Saganaki22/ComfyUI-KugelAudio
License: MIT
Training data: YODAS2 (~200k hours)
Base: Microsoft VibeVoice + Qwen (LLM backbone)

Key Features

Feature	Description
Single‑speaker TTS	Generate speech from text
Voice cloning	Clone a voice from a 5–30s reference audio
Multi‑speaker	Generate conversations with up to 6 speakers (use `Speaker N:` notation)
Watermark	Imperceptible watermark via AudioSeal (detector node available)
4‑bit quantization	Reduce VRAM from ~19GB to ~8GB (CUDA only)
Attention options	SageAttention / FlashAttention / SDPA / Eager

Supported languages: 24 European languages — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, and Turkish. Japanese is not included.

Benchmark

Results from the authors’ human A/B tests (n = 339).

Rank	Model	Score	Win rate
1	KugelAudio	26	78.0%
2	ElevenLabs Multi v2	25	62.2%
3	ElevenLabs v3	21	65.3%
4	Cartesia	21	59.1%
5	VibeVoice	10	28.8%
6	CosyVoice v3	9	14.2%

Notably, it outscored ElevenLabs. That said, these are the authors’ own evaluations rather than independent third‑party results, so keep that in mind.

System Requirements

Mode	VRAM	Notes
Full precision	~19GB	bfloat16
4‑bit quantized	~8GB	CUDA only; SDPA/Eager only

Generation speed is roughly RTF (Real‑Time Factor) ≈ 1.0×. In other words, generating 10 seconds of audio takes about 10 seconds.

Apple Silicon (M1/M2/M3/M4)

MPS is supported, but stability is an issue.

Status

Memory: With 64GB+ you can run full precision (~19GB)
Precision: float16 on MPS (no bfloat16)
4‑bit quantization: Unavailable (bitsandbytes is CUDA‑only)

Known Issues

These caveats are noted in the README:

mps_matmul errors may occur
Sometimes you’ll see “incompatible dimensions” or “LLVM ERROR”
If the above errors show up, switch the Device setting to cpu

Practical Options

Try MPS first
If errors occur, switch to CPU mode (much slower)
If you need practical speed, consider a cloud GPU (e.g., RunPod)

Comparison With Other TTS

Compared with TTS engines previously covered on this blog:

Model	Parameters	Runtime	Japanese	Voice cloning
KugelAudio	7B	GPU (19GB) / 4‑bit (8GB)	❌	✅
Pocket TTS	100M	CPU	❌	✅
VOICEVOX	-	CPU	✅	❌
Style‑Bert‑VITS2	-	GPU recommended	✅	✅

KugelAudio is a large 7B model focused on quality. If you need Japanese, you’ll likely use VOICEVOX or Style‑Bert‑VITS2 instead.

Using With ComfyUI

Install via ComfyUI Manager by searching for “KugelAudio,” or clone manually.

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-KugelAudio.git

On first launch, the model (~14GB) is downloaded automatically.

Core Nodes

KugelAudio TTS: text → speech
KugelAudio Voice Clone: reference audio + text → speech
KugelAudio Multi‑Speaker: multi‑speaker conversation generation
KugelAudio Watermark Check: detect watermark in generated audio

Parameters

cfg_scale: guidance scale (1.0–10.0; default 3.0)
max_new_tokens: max generation length (512–4096; default 2048)
use_4bit: 4‑bit quantization (CUDA only)
attention_type: auto / sage_attn / flash_attn / sdpa / eager
keep_loaded: keep the model in VRAM (faster for consecutive generations)

Building a talkable AI environment (1): voice API survey — comparison of TTS APIs
Pocket TTS — lightweight text‑to‑speech that runs on CPU — an ultra‑lightweight 100M TTS
Specs for running Qwen‑Image‑Edit‑2511 locally — notes on quantization