Tech 3 min read

ACE-Step: an open-source music generation model that runs locally

IkesanContents

What is ACE-Step?

ACE-Step is an open-source foundation model for music generation. It bills itself as the “Stable Diffusion of music” and can generate songs from text prompts.

The standout feature is speed. On an A100, it can generate up to 4 minutes of music in about 20 seconds. Compared to LLM-based music generation (roughly what Suno and Udio are believed to use internally), that is about 15x faster.

Licensed under Apache 2.0, it has over 3,800 stars on GitHub.

Architecture

The technical stack is a combination of:

  • Deep Compression AutoEncoder (DCAE): Audio compression derived from Sana
  • Lightweight Linear Transformer: Enables fast inference
  • Flow-matching: A variant of diffusion models
  • MERT / m-hubert: Semantic alignment during training

Think of it as analogous to the VAE + U-Net + CLIP architecture in image generation AI.

Key features

Basics

  • Up to 4 minutes of music per generation
  • 19 languages supported (strong performance in 10 major languages including Japanese)
  • Both vocal and instrumental output
  • Wide range of genres and styles

Control features

FeatureDescription
VariationsGenerate variations via noise mixing
RepaintingRegenerate only specific sections
Lyric editingChange lyrics while preserving melody and accompaniment
Audio extensionExtend existing tracks

LoRA support

  • Lyric2Vocal: Specialized for vocal generation
  • Text2Samples: Instrument sample generation
  • RapMachine: Rap-specialized

LoRA training code is also available, so you can fine-tune it yourself.

Installation

Prerequisites

  • Python 3.10 or later
  • NVIDIA GPU (CUDA) or Apple Silicon

Steps

git clone https://github.com/ace-step/ACE-Step.git
cd ACE-Step
pip install -e .

Models are automatically downloaded to ~/.cache/ace-step/checkpoints on first launch.

macOS

Apple Silicon does not support bfloat16, so you need a flag at startup:

acestep --port 7865 --bf16 false

Usage

Launching the Gradio UI

acestep --port 7865

Open http://localhost:7865 in a browser to access the UI.

Main options

OptionDescription
--checkpoint_pathPath to a custom model
--torch_compileEnable optimization (requires triton)
--cpu_offloadVRAM saving mode
--bf16Use bfloat16 (must be false on macOS)
--shareGenerate a public Gradio link

Prompt format

Lyrics and style are specified separately. The lyrics field uses section tags:

[verse]
Write lyrics here
La la la

[chorus]
Chorus lyrics

Style is specified in a separate field, e.g. “J-pop, female vocal, energetic”.

Performance by hardware

Official benchmark (27 steps, batch size 1):

GPUReal-time multiplierTime to generate 1 min
RTX 409034.48x1.74s
A10027.27x2.20s
RTX 309012.76x4.70s
M2 Max2.27x26.43s

Expected performance on M1 Max 64GB

Given that M2 Max takes 26 seconds per minute of audio, M1 Max should land around 30 seconds. 64GB of memory is more than enough.

At roughly 2-3x real-time speed, a 4-minute song would take just under 2 minutes to generate. Practical enough.

Comparison with Suno and Udio

AspectACE-StepSuno/Udio
RuntimeLocalCloud
CostElectricity onlySubscription
Commercial useApache 2.0Check terms
CustomizationLoRA supportNot available
QualityStill developingHigh quality

Running locally is a significant advantage, especially when commercial use or customizability matters.

Quality-wise, ACE-Step has not yet caught up with Suno or Udio by most accounts. V1.5 is reportedly in development.

Roadmap

According to the official roadmap:

  • Done: Training code, LoRA training, RapMachine, technical report
  • Planned: ACE-Step V1.5, ControlNet training, Singing2Accompaniment

Once ControlNet support arrives, it should enable workflows like specifying a melody and generating accompaniment around it.