Z-Image — Alibaba’s image generator that reportedly surpasses FLUX

On January 28, 2026, Alibaba’s Tongyi-MAI team released the foundation model for image generation, Z-Image. It is the base model behind Z-Image-Turbo, which had been released earlier in November 2025. The weights are available on Hugging Face under the Apache 2.0 license.

Comments like “it knocked FLUX off the throne” have been circulating in overseas communities, so I looked into it.

Core specs of Z-Image

Architecture: Single-Stream Diffusion Transformer (S3-DiT)
Parameter count: 6 billion (6B)
Supported resolution: 512×512 to 2048×2048 (arbitrary aspect ratio)
Inference steps: 28–50
Guidance scale: 3.0–5.0
Minimum VRAM: 6 GB with quantization (runs even on RTX 2060-class GPUs)
License: Apache 2.0 (commercial use permitted)

Whereas FLUX adopts a Hybrid-Stream DiT (processing text and image separately before fusing), Z-Image uses a Single-Stream approach that processes text embeddings and the noisy image together from the start. This contributes to its strong parameter efficiency.

The Z-Image lineup

Z-Image comes in four models:

Model	Use case
Z-Image	Base model. Suitable for fine-tuning and creating LoRA.
Z-Image-Turbo	Z-Image distilled + RLHF for speed. Can generate in 8 steps.
Z-Image-Omni-Base	Foundation model with multimodal support.
Z-Image-Edit	Instruction-based image editing model.

Z-Image vs Z-Image-Turbo

They are in the same family but quite different in character.

	Z-Image	Z-Image-Turbo
Inference steps	28–50	8
Negative prompts	Supported	Not supported
Diversity	High	Slightly lower
Fine-tuning suitability	High (LoRA, ControlNet)	Low
Image quality	High	Very high
Primary use	Custom model development, research	Fast image generation

Turbo is a distilled model, so it’s fast, but customizability is sacrificed. If you want to bake in LoRA or use ControlNet, Z-Image is the clear choice.

Z-Image vs FLUX vs Stable Diffusion 3.5

Here’s the main comparison: open-source image generators as of January 2026.

	Z-Image	FLUX.1	SD 3.5
Developer	Alibaba (Tongyi-MAI)	Black Forest Labs	Stability AI
Parameters	6B	12B	8B
Architecture	Single-Stream DiT	Hybrid-Stream DiT	MM-DiT
Minimum VRAM	6 GB (quantized)	24 GB+	12 GB+
License	Apache 2.0	Dev: noncommercial / Pro: commercial	Stability AI Community
CFG	Fully supported	Dev: not supported	Supported
Negative prompts	Supported	Dev: not supported / Pro: supported	Supported
Elo ranking (AI Arena)	#1 among open source	Lower	Not listed

Dramatically lighter VRAM requirements

This is Z-Image’s biggest strength. FLUX needs 24 GB or more for the full model—and around 12 GB even when quantized—so it’s tough on an RTX 3060 or 4060. Z-Image can be quantized down to about 6 GB, enabling image generation in roughly 30 seconds even on RTX 2060-class GPUs.

Benchmarks

In Alibaba’s AI Arena Elo-based evaluation, Z-Image-Turbo ranks first among open-source models. It outperforms GPT Image 1 (OpenAI), FLUX.1 Kontext Pro, and Ideogram 3.0, placing fourth globally behind Google Imagen 4 and ByteDance Seedream.

The base Z-Image is also claimed by the authors to offer “performance comparable to models 10× larger,” and it does appear outstanding in parameter efficiency.

Ecosystem still immature

There are weaknesses too. Compared with Stable Diffusion and FLUX, there are far fewer third-party tools, community models, and tutorials. That said, there have been reports that the pace of LoRA creation has surpassed FLUX since release, so this gap may close quickly.

Getting started

diffusers (Python)

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = pipe(
    prompt="a cat sitting on a windowsill at sunset",
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("output.png")

ComfyUI

ComfyUI has native support from day one. You can install it via ComfyUI Manager.