Tech 3 min read

ACE-Step 1.5: AI Music Generation Gets a Full Architecture Overhaul

IkesanContents

V1.5 Is Out

Right after I looked into ACE-Step in the previous article, V1.5 was released on January 31, 2026. The architecture has been fundamentally redesigned, so I investigated again.

The license changed from Apache 2.0 to MIT. It explicitly supports commercial use, and the training data consists solely of legally compliant sources (licensed tracks, royalty-free content, and synthetic data).

Architecture Overhaul

V1.0 and V1.5 have fundamentally different architectures.

ItemV1.0V1.5
ArchitectureDCAE + Linear Transformer + Flow-matchingLM + DiT
Language ModelNoneQwen3-based (0.6B/1.7B/4B)

In V1.5, the LM (Language Model) functions as a planner. It converts the user’s prompt into a “song blueprint” via Chain-of-Thought reasoning, then the DiT (Diffusion Transformer) synthesizes the actual audio — a two-stage pipeline.

Performance Comparison

ItemV1.0V1.5
A100 speed2.2s for 1-min songUnder 2s for full track
RTX 3090 speed4.7s for 1-min songUnder 10s for full track
Language support19 languages50+ languages
Minimum VRAMNot specifiedUnder 4GB
Max song length4 min10 min
Batch generationNot specifiedUp to 8 songs simultaneously

Speed depends on the number of steps, so direct comparison is tricky, but overall performance has improved. The turbo model (8 steps) is especially fast.

Model Variants

V1.5 offers multiple models for different use cases.

DiT (Diffusion Transformer)

ModelStepsCFGQualityDiversity
acestep-v15-base50YesMediumHigh
acestep-v15-sft50YesHighMedium
acestep-v15-turbo8NoVery HighMedium
acestep-v15-turbo-rl8NoVery HighMedium

LM (Language Model)

Choose based on available VRAM.

VRAMRecommended LM
6GB or lessNo LM (DiT only)
6-12GBacestep-5Hz-lm-0.6B
12-16GBacestep-5Hz-lm-1.7B
16GB+acestep-5Hz-lm-4B

It works without an LM, but prompt interpretation accuracy decreases.

New Features

Key features added since V1.0:

  • REST API server: Start with uv run acestep-api
  • Japanese UI: --language ja option
  • Automatic quality scoring: Auto-evaluates generation quality
  • Lyrics timestamp generation: Output in LRC format
  • Audio analysis: BPM, key extraction, caption generation
  • Track separation: Vocal and BGM separation
  • 1000+ instruments and styles supported

Installation

Some changes from V1.0.

Prerequisites

  • Python 3.11 (V1.0 used 3.10)
  • uv package manager

Steps

git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync

Launch

# Gradio UI (default port 7860)
uv run acestep

# Launch with Japanese UI
uv run acestep --language ja

# REST API server
uv run acestep-api

Models are automatically downloaded on first launch. To download manually:

uv run acestep-download --all

Mac (Apple Silicon) Notes

V1.0 had a “static noise” problem in MPS (Metal Performance Shaders) environments. Using data types other than float32 caused generation failures, outputting noise instead of music. This was addressed in PR #21 but wasn’t completely fixed.

V1.5’s README explicitly states “CPU/MPS supported,” but since it was just released (January 31), there’s little feedback from Mac users yet. If problems are frequent in testing, I plan to file an issue.

Migrating from V1.0

Since it’s a separate repository, V1.5 can coexist with an existing V1.0 installation. V1.0 LoRAs most likely won’t work as-is (due to the different architecture).

V1.0 install directory: ACE-Step/ V1.5 install directory: ACE-Step-1.5/

Model cache locations are also different, so keep an eye on storage capacity.