ACE-Step 1.5: AI Music Generation Gets a Full Architecture Overhaul
Contents
V1.5 Is Out
Right after I looked into ACE-Step in the previous article, V1.5 was released on January 31, 2026. The architecture has been fundamentally redesigned, so I investigated again.
- GitHub: ACE-Step-1.5 (separate repo from V1.0)
- arxiv paper
- Hugging Face
The license changed from Apache 2.0 to MIT. It explicitly supports commercial use, and the training data consists solely of legally compliant sources (licensed tracks, royalty-free content, and synthetic data).
Architecture Overhaul
V1.0 and V1.5 have fundamentally different architectures.
| Item | V1.0 | V1.5 |
|---|---|---|
| Architecture | DCAE + Linear Transformer + Flow-matching | LM + DiT |
| Language Model | None | Qwen3-based (0.6B/1.7B/4B) |
In V1.5, the LM (Language Model) functions as a planner. It converts the user’s prompt into a “song blueprint” via Chain-of-Thought reasoning, then the DiT (Diffusion Transformer) synthesizes the actual audio — a two-stage pipeline.
Performance Comparison
| Item | V1.0 | V1.5 |
|---|---|---|
| A100 speed | 2.2s for 1-min song | Under 2s for full track |
| RTX 3090 speed | 4.7s for 1-min song | Under 10s for full track |
| Language support | 19 languages | 50+ languages |
| Minimum VRAM | Not specified | Under 4GB |
| Max song length | 4 min | 10 min |
| Batch generation | Not specified | Up to 8 songs simultaneously |
Speed depends on the number of steps, so direct comparison is tricky, but overall performance has improved. The turbo model (8 steps) is especially fast.
Model Variants
V1.5 offers multiple models for different use cases.
DiT (Diffusion Transformer)
| Model | Steps | CFG | Quality | Diversity |
|---|---|---|---|---|
| acestep-v15-base | 50 | Yes | Medium | High |
| acestep-v15-sft | 50 | Yes | High | Medium |
| acestep-v15-turbo | 8 | No | Very High | Medium |
| acestep-v15-turbo-rl | 8 | No | Very High | Medium |
LM (Language Model)
Choose based on available VRAM.
| VRAM | Recommended LM |
|---|---|
| 6GB or less | No LM (DiT only) |
| 6-12GB | acestep-5Hz-lm-0.6B |
| 12-16GB | acestep-5Hz-lm-1.7B |
| 16GB+ | acestep-5Hz-lm-4B |
It works without an LM, but prompt interpretation accuracy decreases.
New Features
Key features added since V1.0:
- REST API server: Start with
uv run acestep-api - Japanese UI:
--language jaoption - Automatic quality scoring: Auto-evaluates generation quality
- Lyrics timestamp generation: Output in LRC format
- Audio analysis: BPM, key extraction, caption generation
- Track separation: Vocal and BGM separation
- 1000+ instruments and styles supported
Installation
Some changes from V1.0.
Prerequisites
- Python 3.11 (V1.0 used 3.10)
- uv package manager
Steps
git clone https://github.com/ACE-Step/ACE-Step-1.5.git
cd ACE-Step-1.5
uv sync
Launch
# Gradio UI (default port 7860)
uv run acestep
# Launch with Japanese UI
uv run acestep --language ja
# REST API server
uv run acestep-api
Models are automatically downloaded on first launch. To download manually:
uv run acestep-download --all
Mac (Apple Silicon) Notes
V1.0 had a “static noise” problem in MPS (Metal Performance Shaders) environments. Using data types other than float32 caused generation failures, outputting noise instead of music. This was addressed in PR #21 but wasn’t completely fixed.
V1.5’s README explicitly states “CPU/MPS supported,” but since it was just released (January 31), there’s little feedback from Mac users yet. If problems are frequent in testing, I plan to file an issue.
Migrating from V1.0
Since it’s a separate repository, V1.5 can coexist with an existing V1.0 installation. V1.0 LoRAs most likely won’t work as-is (due to the different architecture).
V1.0 install directory: ACE-Step/
V1.5 install directory: ACE-Step-1.5/
Model cache locations are also different, so keep an eye on storage capacity.