Can Z-Image run on RunPod? I checked it for character consistency

Why I looked into this

I had just run Qwen-Image-Edit on RunPod, and then on January 28, 2026, the full version of Z-Image from Alibaba’s same Tongyi lab was released.

What stood out in the samples was how stable the character shapes looked. When I generate illustrations with Qwen-Image-Edit, faces and body shapes sometimes drift. Z-Image is said to support negative prompts, a wider CFG range, and strong LoRA compatibility, so I wondered if it could produce stable manga-style art while keeping the character consistent.

So I checked whether it would run as-is in my current RunPod + ComfyUI setup.

What Z-Image is

Z-Image is a 6B parameter image-generation foundation model developed by Alibaba Tongyi-MAI. It is released under Apache 2.0, so it is commercially usable.

Model lineup

Model	Purpose	Notes
Z-Image	Base model	High quality, high diversity, negative prompts, LoRA support
Z-Image-Turbo	Fast generation	8 steps, sub-second inference
Z-Image-Omni-Base	Unified base	Generation + editing, fine-tuning target
Z-Image-Edit	Image editing	Editing-focused

Architecture

S3-DiT (Scalable Single-Stream Diffusion Transformer)
Flow Matching-based
Text encoder: Qwen 3 4B
Compared with FLUX’s 32B parameters, Z-Image achieves similar quality with 6B
Ranked #1 among open-source models on the Artificial Analysis leaderboard

Comparison with Qwen-Image-Edit

Here are the differences that matter for character consistency:

	Z-Image	Qwen-Image-Edit
Parameters	6B	7B (Qwen2.5-VL-7B)
Negative prompts	Supported	Supported
CFG	3.0-5.0	1.0
LoRA support	Officially supported	Possible, but the setup is more complex
ControlNet	Supported (Union version available)	Limited
Use case	txt2img / img2img	img2img / image editing
Style control	Switch with `cfg_normalization`	Prompt-driven

CFG and negative prompt differences

Qwen-Image-Edit is fixed at CFG 1.0, so negative prompts are weaker. Z-Image lets you tune CFG between 3.0 and 5.0 and negative prompts work properly. Whether prompts like bad anatomy, deformed actually bite matters a lot for character drift, so this is a big deal.

`cfg_normalization`

Z-Image has a model-specific setting: False gives a more stylish, illustration/manga-like output, while True trends more realistic. If you want manga-style art, False is the way to go.

LoRA support for character training

Z-Image is said to have strong LoRA compatibility. If you train your own character as a LoRA, you can keep the face and body stable across many poses and compositions. Qwen-Image-Edit can do it too, but the setup is more complex and Z-Image’s ecosystem looks better organized.

Can it run on RunPod + ComfyUI?

Short answer: yes, with the same workflow as the Qwen NSFW setup. In fact, Z-Image is simpler.

Hardware requirements

GPU	Z-Image	Qwen NSFW (Phr00t AIO)
RTX 4090 (24GB)	Works (about 12GB in bf16)	Does not work (needs 28GB)
RTX 5090 (32GB)	Plenty of room	Barely works

Because Z-Image is only 6B parameters, it runs comfortably on an RTX 4090. I had to move Qwen NSFW over to a 5090, but Z-Image should be fine on a 4090. That also means the cost stays around $0.59/hour.

ComfyUI support

ComfyUI supports it natively from day one. Unlike the Qwen NSFW setup, there is no need to install custom nodes like TextEncodeQwenImageEditPlus or swap in Phr00t’s modified nodes_qwen.py.

Model file layout

ComfyUI/models/
├── text_encoders/
│   └── qwen_3_4b.safetensors
├── diffusion_models/
│   └── z_image_bf16.safetensors
└── vae/
    └── ae.safetensors

Qwen NSFW’s Phr00t AIO was a single 28GB file, while Z-Image is split into separate files. Still, it is clean and simple with only three files.

Expected setup flow

The flow is basically the same as the Qwen NSFW setup.

1. Create a Pod

GPU: RTX 4090 is enough
Template: runpod/comfyui:latest is fine

Unlike the Qwen NSFW version, which needed a 5090+ specific template, this runs on the standard setup.

2. Download the models

pip install huggingface_hub

cd /workspace/ComfyUI/models

# Diffusion model
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('Tongyi-MAI/Z-Image', 'z_image_bf16.safetensors', local_dir='./diffusion_models/')
"

# Text encoder (Qwen 3 4B)
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('Tongyi-MAI/Z-Image', 'qwen_3_4b.safetensors', local_dir='./text_encoders/')
"

# VAE
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download('Tongyi-MAI/Z-Image', 'ae.safetensors', local_dir='./vae/')
"

3. Configure ComfyUI

Sampler: euler / dpmpp_2m
Scheduler: AuraFlow
Steps: 28 to 50
CFG: 3.0 to 5.0
Resolution: 1024x1024 recommended, but 512x512 to 2048x2048 is supported

No custom nodes are needed. The standard ComfyUI nodes are enough.

How the img2img approach differs

Both can do img2img, but the mechanism is different.

Z-Image: a normal diffusion img2img pipeline. It adds noise to the input image and denoises it. denoise strength controls how much of the original image is kept
Qwen-Image-Edit: an image-editing model. It recognizes the character in the input image and edits it according to the prompt

If you want stable manga-style art while keeping the character consistent, the Z-Image img2img + LoRA combination looks more stable. Once the character is trained as a LoRA, both txt2img and img2img should drift less.

Qwen-Image-Edit’s editing approach is intuitive for “change this image into X”, but if you want many consistent variations, it tends to drift more.

There is also an image-editing model in the Z-Image family, Z-Image-Edit, so that is another option when you specifically want editing.

ControlNet can lock the composition

Z-Image-Turbo-Fun-Controlnet-Union is already out. With ControlNet, you can specify a pose and still keep the character stable, which is strong if you want the same character in many poses.

Other image-generation articles on this blog: