ACE-Step: an open-source music generation model that runs locally
Contents
What is ACE-Step?
ACE-Step is an open-source foundation model for music generation. It bills itself as the “Stable Diffusion of music” and can generate songs from text prompts.
The standout feature is speed. On an A100, it can generate up to 4 minutes of music in about 20 seconds. Compared to LLM-based music generation (roughly what Suno and Udio are believed to use internally), that is about 15x faster.
Licensed under Apache 2.0, it has over 3,800 stars on GitHub.
Architecture
The technical stack is a combination of:
- Deep Compression AutoEncoder (DCAE): Audio compression derived from Sana
- Lightweight Linear Transformer: Enables fast inference
- Flow-matching: A variant of diffusion models
- MERT / m-hubert: Semantic alignment during training
Think of it as analogous to the VAE + U-Net + CLIP architecture in image generation AI.
Key features
Basics
- Up to 4 minutes of music per generation
- 19 languages supported (strong performance in 10 major languages including Japanese)
- Both vocal and instrumental output
- Wide range of genres and styles
Control features
| Feature | Description |
|---|---|
| Variations | Generate variations via noise mixing |
| Repainting | Regenerate only specific sections |
| Lyric editing | Change lyrics while preserving melody and accompaniment |
| Audio extension | Extend existing tracks |
LoRA support
- Lyric2Vocal: Specialized for vocal generation
- Text2Samples: Instrument sample generation
- RapMachine: Rap-specialized
LoRA training code is also available, so you can fine-tune it yourself.
Installation
Prerequisites
- Python 3.10 or later
- NVIDIA GPU (CUDA) or Apple Silicon
Steps
git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
pip install -e .
Models are automatically downloaded to ~/.cache/ace-step/checkpoints on first launch.
macOS
Apple Silicon does not support bfloat16, so you need a flag at startup:
acestep --port 7865 --bf16 false
Usage
Launching the Gradio UI
acestep --port 7865
Open http://localhost:7865 in a browser to access the UI.
Main options
| Option | Description |
|---|---|
--checkpoint_path | Path to a custom model |
--torch_compile | Enable optimization (requires triton) |
--cpu_offload | VRAM saving mode |
--bf16 | Use bfloat16 (must be false on macOS) |
--share | Generate a public Gradio link |
Prompt format
Lyrics and style are specified separately. The lyrics field uses section tags:
[verse]
Write lyrics here
La la la
[chorus]
Chorus lyrics
Style is specified in a separate field, e.g. “J-pop, female vocal, energetic”.
Performance by hardware
Official benchmark (27 steps, batch size 1):
| GPU | Real-time multiplier | Time to generate 1 min |
|---|---|---|
| RTX 4090 | 34.48x | 1.74s |
| A100 | 27.27x | 2.20s |
| RTX 3090 | 12.76x | 4.70s |
| M2 Max | 2.27x | 26.43s |
Expected performance on M1 Max 64GB
Given that M2 Max takes 26 seconds per minute of audio, M1 Max should land around 30 seconds. 64GB of memory is more than enough.
At roughly 2-3x real-time speed, a 4-minute song would take just under 2 minutes to generate. Practical enough.
Comparison with Suno and Udio
| Aspect | ACE-Step | Suno/Udio |
|---|---|---|
| Runtime | Local | Cloud |
| Cost | Electricity only | Subscription |
| Commercial use | Apache 2.0 | Check terms |
| Customization | LoRA support | Not available |
| Quality | Still developing | High quality |
Running locally is a significant advantage, especially when commercial use or customizability matters.
Quality-wise, ACE-Step has not yet caught up with Suno or Udio by most accounts. V1.5 is reportedly in development.
Roadmap
According to the official roadmap:
- Done: Training code, LoRA training, RapMachine, technical report
- Planned: ACE-Step V1.5, ControlNet training, Singing2Accompaniment
Once ControlNet support arrives, it should enable workflows like specifying a melody and generating accompaniment around it.