ACE-Step: an open-source music generation model that runs locally

What is ACE-Step?

ACE-Step is an open-source foundation model for music generation. It bills itself as the “Stable Diffusion of music” and can generate songs from text prompts.

The standout feature is speed. On an A100, it can generate up to 4 minutes of music in about 20 seconds. Compared to LLM-based music generation (roughly what Suno and Udio are believed to use internally), that is about 15x faster.

Licensed under Apache 2.0, it has over 3,800 stars on GitHub.

Architecture

The technical stack is a combination of:

Deep Compression AutoEncoder (DCAE): Audio compression derived from Sana
Lightweight Linear Transformer: Enables fast inference
Flow-matching: A variant of diffusion models
MERT / m-hubert: Semantic alignment during training

Think of it as analogous to the VAE + U-Net + CLIP architecture in image generation AI.

Key features

Basics

Up to 4 minutes of music per generation
19 languages supported (strong performance in 10 major languages including Japanese)
Both vocal and instrumental output
Wide range of genres and styles

Control features

Feature	Description
Variations	Generate variations via noise mixing
Repainting	Regenerate only specific sections
Lyric editing	Change lyrics while preserving melody and accompaniment
Audio extension	Extend existing tracks

LoRA support

Lyric2Vocal: Specialized for vocal generation
Text2Samples: Instrument sample generation
RapMachine: Rap-specialized

LoRA training code is also available, so you can fine-tune it yourself.

Installation

Prerequisites

Python 3.10 or later
NVIDIA GPU (CUDA) or Apple Silicon

Steps

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
pip install -e .

Models are automatically downloaded to ~/.cache/ace-step/checkpoints on first launch.

macOS

Apple Silicon does not support bfloat16, so you need a flag at startup:

acestep --port 7865 --bf16 false

Usage

Launching the Gradio UI

acestep --port 7865

Open http://localhost:7865 in a browser to access the UI.

Main options

Option	Description
`--checkpoint_path`	Path to a custom model
`--torch_compile`	Enable optimization (requires triton)
`--cpu_offload`	VRAM saving mode
`--bf16`	Use bfloat16 (must be false on macOS)
`--share`	Generate a public Gradio link

Prompt format

Lyrics and style are specified separately. The lyrics field uses section tags:

[verse]
Write lyrics here
La la la

[chorus]
Chorus lyrics

Style is specified in a separate field, e.g. “J-pop, female vocal, energetic”.

Performance by hardware

Official benchmark (27 steps, batch size 1):

GPU	Real-time multiplier	Time to generate 1 min
RTX 4090	34.48x	1.74s
A100	27.27x	2.20s
RTX 3090	12.76x	4.70s
M2 Max	2.27x	26.43s

Expected performance on M1 Max 64GB

Given that M2 Max takes 26 seconds per minute of audio, M1 Max should land around 30 seconds. 64GB of memory is more than enough.

At roughly 2-3x real-time speed, a 4-minute song would take just under 2 minutes to generate. Practical enough.

Comparison with Suno and Udio

Aspect	ACE-Step	Suno/Udio
Runtime	Local	Cloud
Cost	Electricity only	Subscription
Commercial use	Apache 2.0	Check terms
Customization	LoRA support	Not available
Quality	Still developing	High quality

Running locally is a significant advantage, especially when commercial use or customizability matters.

Quality-wise, ACE-Step has not yet caught up with Suno or Udio by most accounts. V1.5 is reportedly in development.

Roadmap

According to the official roadmap:

Done: Training code, LoRA training, RapMachine, technical report
Planned: ACE-Step V1.5, ControlNet training, Singing2Accompaniment

Once ControlNet support arrives, it should enable workflows like specifying a melody and generating accompaniment around it.