Why are image-generation VAEs so heavy? Comparing the Qwen-Image and HunyuanImage architectures

When I run Qwen-Image-Edit locally, the VAE inference phase takes an annoyingly long time. Even on an M1 Max with 64GB, the model pauses in a way you can really feel. The main 20B model is obviously heavy, but I wanted to understand why the VAE itself is also so slow.

Around the same time, Kohya, the developer behind sd-scripts and musubi-tuner, tweeted that he had reduced VAE memory use from 32GB to 6GB, and Tencent’s HunyuanImage 2.1 was taking a very different approach to VAE design. So I put the pieces together.

what the VAE does

The basic pipeline for image generation is:

text -> text encoder -> DiT (denoising) -> VAE decode -> image

The VAE translates between images and latent space. The encoder compresses images into latent representations, and the decoder reconstructs images from those latents.

What consumes VRAM at inference time is mainly the DiT and the VAE. DiT optimizations such as fp8 quantization, block swap, and SageAttention have progressed, but the VAE side has received much less attention.

why Qwen-Image-Edit’s VAE is heavy

a large Wan-2.1-VAE base

Qwen-Image’s VAE is based on Wan-2.1-VAE. It uses a single-encoder / dual-decoder setup, with the encoder frozen from Wan-2.1 and the decoder fine-tuned on text-rich data such as PDFs, posters, and synthetic text.

That improved text rendering, but it also increased the parameter count and compute cost of the VAE itself.

the cost of double encoding

Because it encodes twice, it can preserve the character while changing only the pose, but the computation around the VAE is simply heavy.

Apple Silicon compatibility

As I wrote in the previous article, the default VAE runs in bfloat16, and that can produce NaNs on Apple’s MPS backend. --fp16-vae works around it, but the fact that the numerical precision matters this much is evidence that the model is large.

HunyuanImage 2.1 takes the opposite approach: a 32x compressed VAE

Tencent’s HunyuanImage 2.1 took the opposite direction for VAE design.

compression ratios

model	VAE compression	note
Stable Diffusion	8x	the classic choice
FLUX	16x	halves the latent size vs SD
HunyuanImage 2.1	32x	very high compression

With 32x spatial compression, a 2048x2048 image becomes a 64x64 latent. FLUX at 16x gives 128x128, and SD at 8x gives 256x256. Smaller latents mean fewer tokens for the DiT, so inference on the DiT side becomes much lighter.

preserving quality with DINOv2

Normally, higher compression means lower quality. HunyuanImage 2.1 aligns the VAE feature space with DINOv2, a self-supervised model from Meta, so that semantic information survives even with the high compression.

That lets it process a 2K image with roughly the token budget of a 1K image in other systems. Kohya also commented that because the VAE is so highly compressed, inference is light for its size.

the 17B DiT

The DiT is a 17B dual-stream design. It is in the same class as Qwen-Image’s 20B model, but because the VAE compresses so aggressively, the effective inference cost is much lower.

comparing the design philosophy

	Qwen-Image	HunyuanImage 2.1
VAE compression	low to medium, based on Wan-2.1	32x, extremely compressed
VAE weight	heavy	relatively light
DiT load	high	low
editing ability	strong, because of double encoding	generation-oriented
text rendering	very high quality	high quality
VRAM requirement	high	relatively lower

Qwen-Image tries to improve edit accuracy by stuffing more capability into the VAE. HunyuanImage 2.1 tries to lighten the VAE and make the whole system more efficient. They are different goals, but for VRAM-constrained machines HunyuanImage’s design is clearly attractive.

Kohya’s VAE memory optimization

musubi-tuner, Kohya’s training and inference framework, is also moving toward better VAE memory efficiency.

main techniques

VAE tiling: split the image and feed tiles through the VAE to lower peak VRAM
CPU offload: use --vae_cache_cpu to keep the VAE cache in system memory
FramePack support: always enable VAE tiling for video models

the 32GB -> 6GB story

As of February 2026, Kohya says he reduced VAE memory use from 32GB to 6GB, but the change had not yet been merged into musubi-tuner at the time of writing. This looks like the usual pattern of tweeting a result first and shipping it later.

where Qwen-Image 2.0 is headed

Released on February 10, 2026, Qwen-Image 2.0 takes yet another approach to the heavy-VAE problem.

parameter count reduced from 20B to 7B
uses Qwen3-VL (8B) as the encoder and a 7B DiT as the decoder
generation and editing are unified into a single model
native 2K resolution support

The idea is to reduce the load indirectly by shrinking the whole model. At the time of writing, though, only API access is available and local weights are not public yet.

the reality for local users

What you can do locally right now:

RTX 4090 or better: use fp8 quantization, SageAttention, and VAE tiling
Apple Silicon with 64GB+: --fp16-vae is required; slow but workable
24GB VRAM or less: cloud GPU services such as RunPod are the realistic option

If Kohya’s VAE optimization gets merged into musubi-tuner, even 24GB setups may stop treating the VAE as the bottleneck. And if high-compression VAEs like HunyuanImage’s spread to other models, things should improve further.

I am also waiting for the Qwen-Image 2.0 weights. If 7B can really handle editing too, maybe even 16GB setups will become viable.