ACE-Step 1.5: AI Music Generation Gets a Full Architecture Overhaul

V1.5 Is Out

Right after I looked into ACE-Step in the previous article, V1.5 was released on January 31, 2026. The architecture has been fundamentally redesigned, so I investigated again.

GitHub: ACE-Step-1.5 (separate repo from V1.0)
arxiv paper
Hugging Face

The license changed from Apache 2.0 to MIT. It explicitly supports commercial use, and the training data consists solely of legally compliant sources (licensed tracks, royalty-free content, and synthetic data).

Architecture Overhaul

V1.0 and V1.5 have fundamentally different architectures.

Item	V1.0	V1.5
Architecture	DCAE + Linear Transformer + Flow-matching	LM + DiT
Language Model	None	Qwen3-based (0.6B/1.7B/4B)

In V1.5, the LM (Language Model) functions as a planner. It converts the user’s prompt into a “song blueprint” via Chain-of-Thought reasoning, then the DiT (Diffusion Transformer) synthesizes the actual audio — a two-stage pipeline.

Performance Comparison

Item	V1.0	V1.5
A100 speed	2.2s for 1-min song	Under 2s for full track
RTX 3090 speed	4.7s for 1-min song	Under 10s for full track
Language support	19 languages	50+ languages
Minimum VRAM	Not specified	Under 4GB
Max song length	4 min	10 min
Batch generation	Not specified	Up to 8 songs simultaneously

Speed depends on the number of steps, so direct comparison is tricky, but overall performance has improved. The turbo model (8 steps) is especially fast.

Model Variants

V1.5 offers multiple models for different use cases.

DiT (Diffusion Transformer)

Model	Steps	CFG	Quality	Diversity
acestep-v15-base	50	Yes	Medium	High
acestep-v15-sft	50	Yes	High	Medium
acestep-v15-turbo	8	No	Very High	Medium
acestep-v15-turbo-rl	8	No	Very High	Medium

LM (Language Model)

Choose based on available VRAM.

VRAM	Recommended LM
6GB or less	No LM (DiT only)
6-12GB	acestep-5Hz-lm-0.6B
12-16GB	acestep-5Hz-lm-1.7B
16GB+	acestep-5Hz-lm-4B

It works without an LM, but prompt interpretation accuracy decreases.

New Features

Key features added since V1.0:

REST API server: Start with uv run acestep-api
Japanese UI: --language ja option
Automatic quality scoring: Auto-evaluates generation quality
Lyrics timestamp generation: Output in LRC format
Audio analysis: BPM, key extraction, caption generation
Track separation: Vocal and BGM separation
1000+ instruments and styles supported

Installation

Some changes from V1.0.

Prerequisites

Python 3.11 (V1.0 used 3.10)
uv package manager

Steps

git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync

Launch

# Gradio UI (default port 7860)
uv run acestep

# Launch with Japanese UI
uv run acestep --language ja

# REST API server
uv run acestep-api

Models are automatically downloaded on first launch. To download manually:

uv run acestep-download --all

Mac (Apple Silicon) Notes

V1.0 had a “static noise” problem in MPS (Metal Performance Shaders) environments. Using data types other than float32 caused generation failures, outputting noise instead of music. This was addressed in PR #21 but wasn’t completely fixed.

V1.5’s README explicitly states “CPU/MPS supported,” but since it was just released (January 31), there’s little feedback from Mac users yet. If problems are frequent in testing, I plan to file an issue.

Migrating from V1.0

Since it’s a separate repository, V1.5 can coexist with an existing V1.0 installation. V1.0 LoRAs most likely won’t work as-is (due to the different architecture).

V1.0 install directory: ACE-Step/ V1.5 install directory: ACE-Step-1.5/

Model cache locations are also different, so keep an eye on storage capacity.