Tech 3 min read

KugelAudio — Open‑Source 7B‑Parameter TTS (ComfyUI‑Compatible)

What Is KugelAudio?

An open‑source Text‑to‑Speech (TTS) model developed by the Hasso‑Plattner‑Institut. It is a large 7B‑parameter model that uses an AR (autoregressive) + diffusion architecture.

Key Features

FeatureDescription
Single‑speaker TTSGenerate speech from text
Voice cloningClone a voice from a 5–30s reference audio
Multi‑speakerGenerate conversations with up to 6 speakers (use Speaker N: notation)
WatermarkImperceptible watermark via AudioSeal (detector node available)
4‑bit quantizationReduce VRAM from ~19GB to ~8GB (CUDA only)
Attention optionsSageAttention / FlashAttention / SDPA / Eager

Supported languages: 24 European languages — English, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Czech, Romanian, Hungarian, Swedish, Danish, Finnish, Norwegian, Greek, Bulgarian, Slovak, Croatian, Serbian, and Turkish. Japanese is not included.

Benchmark

Results from the authors’ human A/B tests (n = 339).

RankModelScoreWin rate
1KugelAudio2678.0%
2ElevenLabs Multi v22562.2%
3ElevenLabs v32165.3%
4Cartesia2159.1%
5VibeVoice1028.8%
6CosyVoice v3914.2%

Notably, it outscored ElevenLabs. That said, these are the authors’ own evaluations rather than independent third‑party results, so keep that in mind.

System Requirements

ModeVRAMNotes
Full precision~19GBbfloat16
4‑bit quantized~8GBCUDA only; SDPA/Eager only

Generation speed is roughly RTF (Real‑Time Factor) ≈ 1.0×. In other words, generating 10 seconds of audio takes about 10 seconds.

Apple Silicon (M1/M2/M3/M4)

MPS is supported, but stability is an issue.

Status

  • Memory: With 64GB+ you can run full precision (~19GB)
  • Precision: float16 on MPS (no bfloat16)
  • 4‑bit quantization: Unavailable (bitsandbytes is CUDA‑only)

Known Issues

These caveats are noted in the README:

  • mps_matmul errors may occur
  • Sometimes you’ll see “incompatible dimensions” or “LLVM ERROR”
  • If the above errors show up, switch the Device setting to cpu

Practical Options

  1. Try MPS first
  2. If errors occur, switch to CPU mode (much slower)
  3. If you need practical speed, consider a cloud GPU (e.g., RunPod)

Comparison With Other TTS

Compared with TTS engines previously covered on this blog:

ModelParametersRuntimeJapaneseVoice cloning
KugelAudio7BGPU (19GB) / 4‑bit (8GB)
Pocket TTS100MCPU
VOICEVOX-CPU
Style‑Bert‑VITS2-GPU recommended

KugelAudio is a large 7B model focused on quality. If you need Japanese, you’ll likely use VOICEVOX or Style‑Bert‑VITS2 instead.

Using With ComfyUI

Install via ComfyUI Manager by searching for “KugelAudio,” or clone manually.

cd ComfyUI/custom_nodes
git clone https://github.com/Saganaki22/ComfyUI-KugelAudio.git

On first launch, the model (~14GB) is downloaded automatically.

Core Nodes

  • KugelAudio TTS: text → speech
  • KugelAudio Voice Clone: reference audio + text → speech
  • KugelAudio Multi‑Speaker: multi‑speaker conversation generation
  • KugelAudio Watermark Check: detect watermark in generated audio

Parameters

  • cfg_scale: guidance scale (1.0–10.0; default 3.0)
  • max_new_tokens: max generation length (512–4096; default 2048)
  • use_4bit: 4‑bit quantization (CUDA only)
  • attention_type: auto / sage_attn / flash_attn / sdpa / eager
  • keep_loaded: keep the model in VRAM (faster for consecutive generations)