Tech 4 min read

Z-Image — Alibaba’s image generator that reportedly surpasses FLUX

On January 28, 2026, Alibaba’s Tongyi-MAI team released the foundation model for image generation, Z-Image. It is the base model behind Z-Image-Turbo, which had been released earlier in November 2025. The weights are available on Hugging Face under the Apache 2.0 license.

Comments like “it knocked FLUX off the throne” have been circulating in overseas communities, so I looked into it.

Core specs of Z-Image

  • Architecture: Single-Stream Diffusion Transformer (S3-DiT)
  • Parameter count: 6 billion (6B)
  • Supported resolution: 512×512 to 2048×2048 (arbitrary aspect ratio)
  • Inference steps: 28–50
  • Guidance scale: 3.0–5.0
  • Minimum VRAM: 6 GB with quantization (runs even on RTX 2060-class GPUs)
  • License: Apache 2.0 (commercial use permitted)

Whereas FLUX adopts a Hybrid-Stream DiT (processing text and image separately before fusing), Z-Image uses a Single-Stream approach that processes text embeddings and the noisy image together from the start. This contributes to its strong parameter efficiency.

The Z-Image lineup

Z-Image comes in four models:

ModelUse case
Z-ImageBase model. Suitable for fine-tuning and creating LoRA.
Z-Image-TurboZ-Image distilled + RLHF for speed. Can generate in 8 steps.
Z-Image-Omni-BaseFoundation model with multimodal support.
Z-Image-EditInstruction-based image editing model.

Z-Image vs Z-Image-Turbo

They are in the same family but quite different in character.

Z-ImageZ-Image-Turbo
Inference steps28–508
Negative promptsSupportedNot supported
DiversityHighSlightly lower
Fine-tuning suitabilityHigh (LoRA, ControlNet)Low
Image qualityHighVery high
Primary useCustom model development, researchFast image generation

Turbo is a distilled model, so it’s fast, but customizability is sacrificed. If you want to bake in LoRA or use ControlNet, Z-Image is the clear choice.

Z-Image vs FLUX vs Stable Diffusion 3.5

Here’s the main comparison: open-source image generators as of January 2026.

Z-ImageFLUX.1SD 3.5
DeveloperAlibaba (Tongyi-MAI)Black Forest LabsStability AI
Parameters6B12B8B
ArchitectureSingle-Stream DiTHybrid-Stream DiTMM-DiT
Minimum VRAM6 GB (quantized)24 GB+12 GB+
LicenseApache 2.0Dev: noncommercial / Pro: commercialStability AI Community
CFGFully supportedDev: not supportedSupported
Negative promptsSupportedDev: not supported / Pro: supportedSupported
Elo ranking (AI Arena)#1 among open sourceLowerNot listed

Dramatically lighter VRAM requirements

This is Z-Image’s biggest strength. FLUX needs 24 GB or more for the full model—and around 12 GB even when quantized—so it’s tough on an RTX 3060 or 4060. Z-Image can be quantized down to about 6 GB, enabling image generation in roughly 30 seconds even on RTX 2060-class GPUs.

Benchmarks

In Alibaba’s AI Arena Elo-based evaluation, Z-Image-Turbo ranks first among open-source models. It outperforms GPT Image 1 (OpenAI), FLUX.1 Kontext Pro, and Ideogram 3.0, placing fourth globally behind Google Imagen 4 and ByteDance Seedream.

The base Z-Image is also claimed by the authors to offer “performance comparable to models 10× larger,” and it does appear outstanding in parameter efficiency.

Ecosystem still immature

There are weaknesses too. Compared with Stable Diffusion and FLUX, there are far fewer third-party tools, community models, and tutorials. That said, there have been reports that the pace of LoRA creation has surpassed FLUX since release, so this gap may close quickly.

Getting started

diffusers (Python)

import torch
from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = pipe(
    prompt="a cat sitting on a windowsill at sunset",
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("output.png")

ComfyUI

ComfyUI has native support from day one. You can install it via ComfyUI Manager.

References