Tech 6 min read

Can Z-Image run on RunPod? I checked it for character consistency

IkesanContents

Why I looked into this

I had just run Qwen-Image-Edit on RunPod, and then on January 28, 2026, the full version of Z-Image from Alibaba’s same Tongyi lab was released.

What stood out in the samples was how stable the character shapes looked. When I generate illustrations with Qwen-Image-Edit, faces and body shapes sometimes drift. Z-Image is said to support negative prompts, a wider CFG range, and strong LoRA compatibility, so I wondered if it could produce stable manga-style art while keeping the character consistent.

So I checked whether it would run as-is in my current RunPod + ComfyUI setup.

What Z-Image is

Z-Image is a 6B parameter image-generation foundation model developed by Alibaba Tongyi-MAI. It is released under Apache 2.0, so it is commercially usable.

Model lineup

ModelPurposeNotes
Z-ImageBase modelHigh quality, high diversity, negative prompts, LoRA support
Z-Image-TurboFast generation8 steps, sub-second inference
Z-Image-Omni-BaseUnified baseGeneration + editing, fine-tuning target
Z-Image-EditImage editingEditing-focused

Architecture

  • S3-DiT (Scalable Single-Stream Diffusion Transformer)
  • Flow Matching-based
  • Text encoder: Qwen 3 4B
  • Compared with FLUX’s 32B parameters, Z-Image achieves similar quality with 6B
  • Ranked #1 among open-source models on the Artificial Analysis leaderboard

Comparison with Qwen-Image-Edit

Here are the differences that matter for character consistency:

Z-ImageQwen-Image-Edit
Parameters6B7B (Qwen2.5-VL-7B)
Negative promptsSupportedSupported
CFG3.0-5.01.0
LoRA supportOfficially supportedPossible, but the setup is more complex
ControlNetSupported (Union version available)Limited
Use casetxt2img / img2imgimg2img / image editing
Style controlSwitch with cfg_normalizationPrompt-driven

CFG and negative prompt differences

Qwen-Image-Edit is fixed at CFG 1.0, so negative prompts are weaker. Z-Image lets you tune CFG between 3.0 and 5.0 and negative prompts work properly. Whether prompts like bad anatomy, deformed actually bite matters a lot for character drift, so this is a big deal.

cfg_normalization

Z-Image has a model-specific setting: False gives a more stylish, illustration/manga-like output, while True trends more realistic. If you want manga-style art, False is the way to go.

LoRA support for character training

Z-Image is said to have strong LoRA compatibility. If you train your own character as a LoRA, you can keep the face and body stable across many poses and compositions. Qwen-Image-Edit can do it too, but the setup is more complex and Z-Image’s ecosystem looks better organized.

Can it run on RunPod + ComfyUI?

Short answer: yes, with the same workflow as the Qwen NSFW setup. In fact, Z-Image is simpler.

Hardware requirements

GPUZ-ImageQwen NSFW (Phr00t AIO)
RTX 4090 (24GB)Works (about 12GB in bf16)Does not work (needs 28GB)
RTX 5090 (32GB)Plenty of roomBarely works

Because Z-Image is only 6B parameters, it runs comfortably on an RTX 4090. I had to move Qwen NSFW over to a 5090, but Z-Image should be fine on a 4090. That also means the cost stays around $0.59/hour.

ComfyUI support

ComfyUI supports it natively from day one. Unlike the Qwen NSFW setup, there is no need to install custom nodes like TextEncodeQwenImageEditPlus or swap in Phr00t’s modified nodes_qwen.py.

Model file layout

ComfyUI/models/
├── text_encoders/
│   └── qwen_3_4b.safetensors
├── diffusion_models/
│   └── z_image_bf16.safetensors
└── vae/
    └── ae.safetensors

Qwen NSFW’s Phr00t AIO was a single 28GB file, while Z-Image is split into separate files. Still, it is clean and simple with only three files.

Expected setup flow

The flow is basically the same as the Qwen NSFW setup.

1. Create a Pod

  • GPU: RTX 4090 is enough
  • Template: runpod/comfyui:latest is fine

Unlike the Qwen NSFW version, which needed a 5090+ specific template, this runs on the standard setup.

2. Download the models

pip install huggingface_hub

cd /workspace/ComfyUI/models

# Diffusion model
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('Tongyi-MAI/Z-Image', 'z_image_bf16.safetensors', local_dir='./diffusion_models/')
"

# Text encoder (Qwen 3 4B)
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('Tongyi-MAI/Z-Image', 'qwen_3_4b.safetensors', local_dir='./text_encoders/')
"

# VAE
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('Tongyi-MAI/Z-Image', 'ae.safetensors', local_dir='./vae/')
"

3. Configure ComfyUI

  • Sampler: euler / dpmpp_2m
  • Scheduler: AuraFlow
  • Steps: 28 to 50
  • CFG: 3.0 to 5.0
  • Resolution: 1024x1024 recommended, but 512x512 to 2048x2048 is supported

No custom nodes are needed. The standard ComfyUI nodes are enough.

How the img2img approach differs

Both can do img2img, but the mechanism is different.

  • Z-Image: a normal diffusion img2img pipeline. It adds noise to the input image and denoises it. denoise strength controls how much of the original image is kept
  • Qwen-Image-Edit: an image-editing model. It recognizes the character in the input image and edits it according to the prompt

If you want stable manga-style art while keeping the character consistent, the Z-Image img2img + LoRA combination looks more stable. Once the character is trained as a LoRA, both txt2img and img2img should drift less.

Qwen-Image-Edit’s editing approach is intuitive for “change this image into X”, but if you want many consistent variations, it tends to drift more.

There is also an image-editing model in the Z-Image family, Z-Image-Edit, so that is another option when you specifically want editing.

ControlNet can lock the composition

Z-Image-Turbo-Fun-Controlnet-Union is already out. With ControlNet, you can specify a pose and still keep the character stable, which is strong if you want the same character in many poses.

Other image-generation articles on this blog:

References