Khala open-source song generator: 24GB VRAM, 64-layer RVQ, quality flag live

Khala is an open-source implementation for generating full-length songs from lyrics plus text conditions. It ships a Web UI, a FastAPI backend, a GPU inference worker, and the model weights on Hugging Face — so it’s not just a paper, but a complete kit for actually running the system.

The catch: as a local music-generation setup, it sits much closer to the GPU-server side than ACE-Step UI.
The README recommends 24GB+ NVIDIA GPUs and assumes Docker + NVIDIA Container Toolkit.
This is less “can I squeeze it onto a Mac or a low-VRAM GPU” and more “let’s run a research implementation on an RTX 4090+ box.”

What you get out: one mixed audio file with vocals and accompaniment

“Music generation” covers several different output shapes depending on the model.

Output type	What you get
MIDI / symbolic	A sequence of notes. Editable in a DAW — swap instruments, change chords
Stem-separated	Separate tracks for vocals, drums, bass, etc. Remix freely afterwards
Accompaniment-only	BGM without vocals
Full-song	One finished file with vocals, accompaniment, and spatial effects all mixed in

Khala is the last type, the “full-song” kind, in the same family as Suno, Udio, and ACE-Step.
The output is one mixed WAV and one mixed MP3 (plus a JSON metadata file) written into backend/generated_audio/.
You don’t get separated stems for vocals or accompaniment. It isn’t MIDI either, so you can’t edit it later in a DAW by saying “change the chord here.”

The behavior with lyrics is easier to read once you accept this framing.
The goal spec is that the vocal part inside that one mixed file actually sings the lyrics passed in lyrics.
The paper’s claim is that lyric-to-vocal time alignment emerges from pure acoustic-token language modeling.

That said, perfect word-for-word alignment isn’t guaranteed.
Suno and Udio both produce lyric drift and pronunciation glitches regularly, and Khala still has its 2026-05-07 quality warning (numerical-precision suspicion) unresolved — so lyric alignment isn’t a useful evaluation axis yet.

If you pick mode="instrumental", only the accompaniment side comes out.
In that case the lyrics field is ignored, and the desired accompaniment character is described via tags or a natural-language prompt.
Conversely, if you use mode="vocal" with empty lyrics, generation runs without any anchor for the vocal line and the output is unstable.

Length is clamped to the 1–10 minute range.
Each request writes one WAV and one MP3 as different encodings of the same audio, plus a JSON file with the generation settings and the prompt used.

What parts are actually mixed in

Saying “one mixed file” doesn’t tell you what’s mixed inside it. That depends on the song.
Cross-referencing the tag catalog (backend/tags.json, especially Vocals/Instruments and Rhythm/Production) with typical pop track stacks, the elements Khala can put into its output line up roughly like this.

Part	Role / Example tags
Lead vocal (main melody)	The voice actually singing the lyrics. Only when `mode="vocal"`. Use `Male Vocals` / `Female Vocals` to bias gender
Backing vocals / harmonies	Added via tags like `Harmonized Vocals` / `Choir` / `Duet` / `Male and Female Vocals` / `Vocal Chops`
Drums / percussion	`Electronic Drums` / `Drum Machine` / `Driving Drums` / `Percussion` / `Four-on-the-floor`
Bass	`Synth Bass` / `808 Bass` / `Sub-bass` / `Heavy Bass` / `Distorted Bass`
Lead and rhythm instruments	`Piano` / `Solo Piano` / `Electric Guitar` / `Distorted Guitar` / `Acoustic Guitar` / `Synth Lead` / `Brass` / `Cello` / `Guzheng`
Pads / harmonic layer	`Synth Pads` / `Atmospheric Pads` / `Strings` / `Synth Strings` / `Synth Arpeggios`
Spatial / mix processing	`Reverb` / `Lush Reverb` / `Wide Stereo` / `Sidechain Compression` / `Vinyl Crackle` / `Tape Hiss`

The thing to keep in mind: the “parts” above are how a human ear classifies the result, not separate tracks inside Khala.
What the model emits is a sequence of 64-layer RVQ acoustic tokens, and at that level vocals, drums, piano, and pads are all decided together as parts of the same intertwined signal.
Passing the tag Piano doesn’t “add a piano track” — it biases the whole signal generation toward audio that sounds like a piano is playing.

The same applies to the vocal/accompaniment relationship: at the signal level they’re entangled from the start.
A common misconception is to picture a karaoke-style or DAW-style flow where you build the backing first and then overlay vocals on top. Khala (and the Suno/Udio family of full-song generators) does not work that way.
From the very first step, a vocal-bearing waveform is generated as a whole — there is no stage where “the BGM is ready and the vocals get laid on top.”

Lyric-to-vocal time alignment also happens inside the acoustic-token language model itself, as “put the right sounds in the position where it should sound like the words are being sung.” That’s the paper’s framing.
The reason stem separation is fundamentally hard and partial edits are limited comes directly from this joint-generation structure.

If you don’t like part of it, you’re basically stuck regenerating

The “one mixed file as the only output” shape becomes a hard operational constraint in practice.

“The A-section is fine but I don’t like the vocals in the chorus”, “only the third line of the lyrics is mispronounced”, “I want a different drum pattern only” — none of these partial fixes are possible through Khala’s official API.
POST /generate is an endpoint that produces a song from scratch; the request body has no slot for inpainting or continuation (re-synthesizing a specific section, or extending an existing track).
And since vocals-only and accompaniment-only stems are not available either, you can’t say “keep the backing, just replace the vocal” either.

The realistic options come down to roughly these:

Regenerate with the same prompt but different temperature or top_k_bb (roll the gacha)
Rewrite the prompt, tags, or lyrics and regenerate
Do waveform-level post-processing afterwards in a DAW (cut, EQ, time-stretch, pitch correction)
Use stem-separation tools like Demucs to pseudo-separate vocals/accompaniment (lossy)

The first two dominate, i.e. a gacha-style “keep regenerating until one of them is the keeper” workflow.
Suno and Udio share the same basic shape — they’re built on top of “bet on the per-song roll.”
That said, when rolling alone can’t get you to acceptable quality, putting external stem separation in between and doing partial edits is the realistic escape hatch.

The “fake stem-based editing” route via post-hoc separation

Khala itself doesn’t have separation features, but if you feed the output WAV into an existing stem-separation tool, you can recover some level of per-part editing.
This is the route you take if you want to swap a vocal take, EQ an instrument, or otherwise pull Khala’s output into a normal DAW production flow.

Representative tools:

Tool	Notes
Demucs (Meta, OSS)	The standard for 4-stem separation (vocals / drums / bass / other). Currently among the highest-quality options
MDX-Net	A newer separation model. Often used alongside Demucs
Spleeter (Deezer, OSS)	The original. Lets you pick 2-stem / 4-stem / 5-stem
UVR (Ultimate Vocal Remover, OSS)	A GUI front-end that wraps the above model families. On Windows, this or Google Colab is the easiest path

All of these do post-hoc estimation. They aren’t pulling out individual tracks that Khala held internally — Khala never had them.
You’ll get bleed of the backing in the vocal stem, leftover cymbal tails in the drumless mix, and at 4-stem granularity the other bucket contains piano, guitar, and pads all mashed together.
Per-instrument precision separation, like “extract only the piano so I can reharmonize,” is hard with current 4-stem systems.

A typical real-world workflow ends up looking like this:

Generate a song with Khala (gacha included)
Run the output WAV through Demucs / UVR for 4-stem separation
Load into a DAW and edit: swap the vocal for a different take, change the drum pattern, EQ or sidechain only the accompaniment side, etc.
Remix and bounce again

The more Khala’s output actually “sounds like a real performance,” the better stem separation tends to work; the more AI-style distortion and artifacts ride along, the noisier the separation gets.
Until the 2026-05-07 quality warning is gone, even this post-separation editing route is hard to plan around.

When you’re folding full-song generators into a production flow, the shape is: use Khala as a rough-draft generator, run Demucs (or similar) to get stems, then refine in the DAW. Using it solo as a “final-asset generator” is hard while both the quality warning and the non-commercial license are still in place.

64-layer RVQ is the generation target directly

The core of Khala is a design that discretizes audio with 64-layer RVQ (Residual Vector Quantization) and uses that hierarchy across both coarse musical structure and fine acoustic detail.
The paper frames this as handling structure and high-fidelity audio not in separate representation spaces, but in stages within the same deep acoustic-token hierarchy.

The pipeline splits in two.
A Backbone produces the coarse acoustic tokens of the full song, and a Super-resolution model fills in the remaining fine RVQ layers.
A DAC RVQ decoder finally converts that back to a waveform.

flowchart TD
    A[Prompt and lyrics] --> B[Backbone<br/>coarse acoustic tokens]
    B --> C[Super-resolution<br/>q0 to q63]
    C --> D[DAC RVQ decoder]
    D --> E[WAV and MP3]

The “Super-resolution” here isn’t image upscaling — it’s a process that fills in the finer layers of the acoustic-token hierarchy.
The arXiv abstract notes that the super-resolution stage operates at full-song scale and processes the time dimension in parallel while filling layers sequentially, resulting in a fixed 62-step inference.

Handling lyric alignment without semantic tokens

Speech and music generation often split tokens into a semantic-leaning side (HuBERT or w2v-BERT family) and a quality-leaning side (DAC or EnCodec family).
The two-stage idea is: decide “what is being said” on the semantic side and “how it sounds” on the quality side, so the text-to-vocal correspondence becomes tractable.
Khala doesn’t split there. It claims the text-vocal correspondence emerges purely from acoustic-token language modeling.

To make this work, training uses a hybrid attention pattern.
The Backbone’s lyric-alignment side uses causal attention (the natural auto-regressive ordering), while the Super-resolution’s per-layer refinement uses full attention (bidirectional).
The lyrics-to-vocal time ordering is not broken, while the layer-direction detail filling uses full surrounding context.

The paper also reports that initializing the Super-resolution model from a Backbone checkpoint, rather than from scratch, improves both convergence and final quality.
Since both stages live in the same acoustic-token space, the coarse-side representation makes a useful starting point for the fine side.

Inference is fixed at 62 steps.
The super-resolution stage runs in parallel over time and sequentially over the RVQ layer dimension (q0 to q63), so step count doesn’t grow with song length.
The flip side: even short clips pay the full 62-step cost, so this design has comparatively high overhead for short-clip use cases like loops.

This is a different family from the Flow-matching approach covered in the ACE-Step V1.0 background notes.

Aspect	Khala	ACE-Step V1.0
Generation method	64-layer RVQ acoustic-token LM	Flow-matching diffusion
Two-stage?	Backbone → Super-resolution	Single stage
Inference steps	Fixed 62	Variable by diffusion steps
Recommended GPU	NVIDIA 24GB+	Runs on Mac / mid-tier GPUs
Distribution	Docker + FastAPI + Vite	Desktop-app oriented
License	CC BY-NC 4.0 (non-commercial)	Apache-2.0 family

ACE-Step is an implementation that puts local execution speed and production-style features front and center, while Khala takes the approach of pushing musical structure and acoustic detail through the same token hierarchy.

The implementation is a researcher-oriented server setup

The README aims the current release at researchers and developers who are already comfortable on GPU servers.
The prebuilt image is ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24; docker run --gpus all puts you in an NGC PyTorch-based container.
Host → container port mappings open 30869 (frontend) and 8889 (API).

The system has three layers.
The frontend is Vite + React and collects the prompt, lyrics, and generation settings.
An API dispatcher (backend_api.py) turns requests into jobs, queues them, and routes them to an idle inference worker.
The inference worker (backend_worker.py) runs one process per GPU, walking the tokenizer, Megatron Backbone, Super-resolution, and decoder in sequence to produce audio.

flowchart TD
    A[Frontend UI] --> B[backend_api.py]
    B --> C[backend_worker.py]
    C --> D[Backbone]
    D --> E[Super-resolution]
    E --> F[Decoder]
    F --> G[generated_audio/]
    G --> B
    B --> A

run_backend.sh got a 2026-05-11 update making it easy to launch on a single GPU with safe defaults.
GPU selection is --gpus 0 (single GPU), --gpus 0,1, or --gpus 6,7 (multi-GPU); one worker spawns per GPU id.
Runtime mode is --runtime-mode one_shot vs --runtime-mode keep_loaded.

one_shot frees parts of the model after each request, so it’s the safer side for a single 24GB-class GPU.
keep_loaded keeps weights resident in VRAM, which suits higher-memory environments doing bursts of inference.
Pick by whether you prioritize throughput or want to avoid VRAM pressure.

Ports can be changed via API_PORT, WORKER_BASE_PORT, and BASE_MASTER_PORT inside run_backend.sh.
API_PORT is the frontend-facing entry, WORKER_BASE_PORT is what each worker binds, and BASE_MASTER_PORT is used by Megatron’s distributed init.
If you spawn multiple workers and collide with an existing service, shift these three.

Logs split into backend/logs/api.log and backend/logs/worker_0.log (per worker id).
The API log shows job queueing and dispatch; the worker log shows where things stop within Backbone → Super-resolution → Decoder.
Outputs end up under backend/generated_audio/ as a .wav, .mp3, and .json metadata set.

What you actually throw at it to generate a song

The UI (or POST /generate) takes 8 fields.
They map directly to GenerateRequest in backend_api.py.

Field	Type / default	Description
`mode`	`"vocal"` / `"instrumental"`	With vocals or instrumental only
`prompt_mode`	`"tags"` / `"natural"`	Specify with tags, or with a natural-language prompt
`prompt`	string	Free text used when `prompt_mode="natural"`
`tags`	string (comma-separated)	Tag list used when `prompt_mode="tags"`
`lyrics`	string	Lyrics. Only used when `mode="vocal"`
`duration`	int, default `3`	Song length in minutes. Clamped to the `1–10` range
`top_k_bb`	int, default `80`	Top-k sampling for the Backbone
`temperature`	float, default `1.0`	Sampling temperature for the Backbone

The tag catalog is exposed at GET /tags (backed by backend/tags.json).
It splits into four categories, shown as per-category chips in the UI.

Category	Example tags
Genre/Style	`J-Pop` / `Lo-fi` / `Cinematic` / `EDM` / `Hip-hop` / `Synthwave` / `Ballad` / `Anime`
Vocals/Instruments	`Male Vocals` / `Female Vocals` / `Piano` / `Synth Bass` / `Distorted Guitar` / `Strings` / `808 Bass` / `Instrumental`
Emotion/Mood	`Melancholic` / `Nostalgic` / `Uplifting` / `Atmospheric` / `Anthemic` / `Dreamy`
Rhythm/Production	`Slow Tempo` / `128 BPM` / `Reverb` / `Wide Stereo` / `Lo-fi Hiss` / `Sidechain Compression`

Lyrics go through clean_lyrics() internally.
That’s a simple preprocessor that just converts Japanese/Chinese full-width punctuation (、 。 ？ 「」 （） etc.) and full-width alphanumerics to ASCII. There’s no strict format requirement like demanding structural tags ([Verse], [Chorus], etc.).
You can paste the raw text and pass it as-is.

Tag mode, minimal example

curl -X POST http://127.0.0.1:8889/generate \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "vocal",
    "prompt_mode": "tags",
    "tags": "J-Pop, Female Vocals, Piano, Melancholic, Slow Tempo",
    "lyrics": "夜風が窓を叩く\n眠れないまま朝を待つ",
    "duration": 3,
    "top_k_bb": 80,
    "temperature": 1.0
  }'

The response returns a job_id. Progress is at GET /job/{job_id}, and the audio at GET /job/{job_id}/track/0/mp3 (or /wav).

Natural-language mode, minimal example

curl -X POST http://127.0.0.1:8889/generate \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "instrumental",
    "prompt_mode": "natural",
    "prompt": "A dreamy lo-fi hip-hop instrumental with vinyl crackle, warm piano chords, and a slow boom-bap beat around 80 BPM.",
    "duration": 2,
    "top_k_bb": 80,
    "temperature": 1.0
  }'

When mode="instrumental", lyrics is ignored.
Conversely, calling mode="vocal" with empty lyrics runs generation without any anchor for the vocal line and the output is unstable.

Job progression and phases

Workers move internal state through backbone → superres → decoding → idle.
The GET /job/{job_id} response includes the current phase, so a UI or external tool can show “what’s currently being processed.”
If processing feels long, but worker_0.log is moving through phases, it’s still alive. When it actually gets stuck, it’s usually the first Backbone load or CUDA OOM.

~52GB of weights, non-commercial license

Checked via Hugging Face API: liujiafeng/Khala-MusicGeneration-v1.0 uses about 52GB of storage.
The contents split into Backbone, Super-resolution, and DAC RVQ decoder checkpoints.
The README’s procedure has you create checkpoints/ at the repo root and pull with hf download liujiafeng/Khala-MusicGeneration-v1.0 --local-dir checkpoints.

The model card license is cc-by-nc-4.0.
The GitHub README also states the model weights are intended to be released under Creative Commons Attribution-NonCommercial 4.0 International.
If your assumption is using this as commercial production material, it’s not usable at this point.

The code-side GitHub repository as of 2026-05-17 is at 113 stars, 9 forks, with no release tagged.
About 30 stars in the last 24 hours, aligning with the 2026-05-16 demo page release.
On the Hugging Face side, the model card reports 13 likes and 0 downloads.
Numbers alone don’t decide quality, but this isn’t yet at the level of a mature daily-use tool.

The 2026-05-07 quality warning is still up

The README’s News section still carries a 2026-05-07 entry: there is an issue under investigation that may significantly affect inference quality, with numerical precision suspected as a possible cause.
The 2026-05-16 online audio demo went live, but this warning has not been removed.

In this state, when the same prompt produces variable quality depending on runtime mode or environment, you can’t separate whether it’s a model-design issue or a numerical-precision issue.
For the time being, the useful verification targets are less about audio-quality judgement and more about environment setup, memory usage, worker separation, job processing, and weight placement.

If you want to try it, start from a 4090+ Docker environment

To run it locally, prepare a CUDA environment around RTX 4090 / A5000 / A6000 / L40S / A100, and start by running it through Docker per the README.
With the 24GB VRAM, ~52GB weights, and NGC PyTorch container combination, M1 Max or RTX 4060 Laptop won’t meet the recommended conditions.

For initial verification, audio-quality judgment matters less than getting one song’s WAV and MP3 to actually land in backend/generated_audio/ while watching backend/logs/api.log and backend/logs/worker_0.log.
With both the non-commercial license (CC BY-NC 4.0) and the 2026-05-07 quality warning still in place, the path from output to commercial production material is closed off.