Tech 11 min read

Tracking Down Why Qwen Image Edit Started Taking 10 Minutes After a ComfyUI Update

In a previous article, I was generating images with Qwen Image Edit in about 80 seconds on an M1 Max 64GB.

Then I updated ComfyUI, and the same environment started taking nearly 10 minutes. I thought changing startup options might fix it, but that wasn’t the issue.

Environment

  • Mac Studio M1 Max 64GB
  • macOS 26.3 (Tahoe)
  • Python 3.12.12
  • PyTorch 2.10.0
  • ComfyUI v0.16.4

Initial Suspicion: —gpu-only

When I wrote the Qwen article, the startup command was just python main.py --fp16-vae. After that, I’d added --gpu-only to an alias for Illustrious speed improvements.

python main.py --gpu-only --fp16-vae

--gpu-only is unnecessary on Apple Silicon (CPU and GPU share the same physical memory with unified memory), so I removed it, but sampling time barely changed. Not a startup option issue.

Benchmark Reveals: BF16 Is 2x Slower on MPS

Nothing was improving despite trying various things, so I benchmarked matmul performance directly on MPS.

import torch, time
device = torch.device('mps')
a = torch.randn(4096, 4096, dtype=torch.bfloat16, device=device)
b = torch.randn(4096, 4096, dtype=torch.bfloat16, device=device)
# ... (average of 10 runs)
dtype4096x4096 matmulRelative speed
FP1614.5ms1.0x (fastest)
FP3215.9ms1.1x
BF1630.3ms2.1x (slowest)

BF16 is 2x slower than FP16. Even slower than FP32. Same result on PyTorch 2.4.1, so it’s not a version issue — MPS’s BF16 implementation itself is slow.

M1/M2/M3 chips lack hardware acceleration for BF16 (native support starts with M4), so it’s running through software emulation.

Qwen Image Edit’s diffusion model runs with model weight dtype torch.bfloat16. Meaning all operations run through this 2x-slower BF16.

Forcing FP16 and the Results

I tried --fp16-unet to force FP16.

Result generated with FP16 (completely black)

Completely black output. Since macOS 14.5, PyTorch’s MPS backend has a bug where FP16 Attention outputs NaN. ComfyUI handles this:

# comfy/model_management.py
def force_upcast_attention_dtype():
    macos_version = mac_version()
    if macos_version is not None and ((14, 5) <= macos_version):
        upcast = True
    if upcast:
        return {torch.float16: torch.float32}  # upcast FP16 Attention to FP32

FP16 model → upcast only Attention to FP32 → correct image should appear… but for some models (especially AIO models loaded with CheckpointLoaderSimple), upcast doesn’t work and you get black images.

Root Cause: Investigating ComfyUI Commits

Digging through git log, I found two relevant commits.

Commit 1: 96d891cb (2025-02-24)

# Before: upcast all Attention dtypes to FP32
return torch.float32

# After: only upcast FP16 to FP32, leave BF16 as-is
return {torch.float16: torch.float32}

Commit 2: 6e28a464 (2025-06-10)

# Before: only upcast for macOS 14.5–15.x (exclude 16+)
if (14, 5) <= macos_version < (16,):

# After: upcast for all macOS 14.5+
if (14, 5) <= macos_version:

My initial hypothesis was “BF16 Attention’s FP32 upcast was removed, causing slowdown,” but when I actually re-enabled BF16→FP32 upcast, it got 1.6x slower (5:29 vs 3:59). The Attention intermediate tensors are massive, so the conversion cost and memory bandwidth increase become the bottleneck. Benchmark numbers don’t map directly.

So the reason it was fast before wasn’t FP32 upcast. The model was probably running in FP16 before. FP32 upcast was for preventing black images, not the speed source.

At some point during ComfyUI updates, Qwen Image Edit’s default inference dtype changed from FP16 to BF16, which combined with M1–M3’s lack of BF16 hardware support to cause the slowdown — that’s the most coherent explanation.

Full Test Results

Split model (qwen_image_edit_2511_bf16.safetensors / 38GB)

SettingAttention1 stepSampling (4 steps)Image
--fp16-vaesub-quadratic (BF16)~51s3:59OK
--fp16-vae --fp16-unetsub-quadratic (FP32 upcast)~38s2:33OK
--fp16-vae --fp16-unet --use-split-cross-attentionsplit (FP32 upcast)~128sgave up-
--fp16-vae + upcast disabledsub-quadratic (FP16)~25s1:40black

Image generated successfully with BF16

AIO model (Qwen-Rapid-AIO-NSFW-v16 / 26GB FP8)

FP8 models aren’t supported on MPS, so a patch to convert to BF16 on load is required.

Conversion target1 stepSampling (4 steps)Image
FP16~25s1:59black
BF16~55s3:40OK

AIO BF16 generated image

GGUF (qwen-image-edit-2511-Q8_0.gguf / 20GB)

Setting1 stepSampling (4 steps)Image
--fp16-vae~48s3:12blurry

GGUF Q8_0 generated image (blurry)

Current Best Settings

python main.py --fp16-vae --fp16-unet

Use with the split model (qwen_image_edit_2511_bf16.safetensors). FP16 linear + FP32 Attention upcast combination gives 2:33 sampling.

Doesn’t get back to 80 seconds, but a big improvement from 10 minutes.

Patch for AIO Models

FP8 models can’t run directly on MPS, so they need to be converted to BF16 on load. Modify these 2 files:

comfy/sd.py

Add after weight_dtype in the load_state_dict_guess_config function:

# MPS does not support float8 types - convert FP8 weights to BF16 upfront
if model_management.is_device_mps(load_device):
    fp8_types = model_management.FLOAT8_TYPES
    converted = False
    for k in sd:
        if sd[k].dtype in fp8_types:
            sd[k] = sd[k].to(torch.bfloat16)
            converted = True
    if converted:
        weight_dtype = comfy.utils.weight_dtype(sd, diffusion_model_prefix)
        logging.info("Converted FP8 weights to BF16 for MPS compatibility")

comfy/model_management.py

Add at the start of the cast_to function:

# MPS does not support float8 types - cast to bf16 on CPU first
if device is not None and is_device_mps(device) and weight.dtype in FLOAT8_TYPES:
    if dtype is None or dtype in FLOAT8_TYPES:
        dtype = torch.bfloat16
    weight = weight.to(dtype=dtype)
    copy = False

Note: Converting to FP16 produces black images. Always use BF16.

comfy/supported_models.py

Add FP16 to supported_inference_dtypes in the QwenImage class (required when using --fp16-unet):

supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]

What the AI Assistants Suggested

I also asked Gemini and ChatGPT about this issue. I verified every suggestion on real hardware:

Gemini’s Suggestions

SuggestionActual result
--force-fp16Black images (MPS FP16 Attention bug)
--use-split-cross-attention2.5x slower (128s/step)
--fp32-vaeUnnecessary. --fp16-vae works fine
--disable-smart-memoryNo effect in Mac SHARED mode
GGUF Q4 modelSame speed, quality degraded
TORCH_MATH_DISABLE_SDPA=1Non-existent environment variable (hallucination)

Three exchanges, zero reproducible improvements from real testing.

ChatGPT’s Suggestions

ChatGPT’s analysis more accurately reflected MPS/BF16 constraints. Particularly “M1–M3 lack BF16 hardware support” and “ComfyUI commits are the cause” were on target.

The only concrete suggestion was “upcasting BF16 Attention to FP32 should speed it up,” but as measured, it actually got 1.6x slower. The theory that it should be faster based on benchmark numbers doesn’t match when massive Attention intermediate tensor memory bandwidth dominates. Theory and measurement don’t always agree.

”It’s Slow Because It’s Flux-Based” — Not Quite

ComfyUI recognizes Qwen Image Edit as model_type FLUX. From that, it’s tempting to conclude “it’s slow because it’s Flux-based,” but that’s not accurate.

  • FLUX.1 Kontext: 12B parameter rectified flow transformer
  • Qwen Image Edit: 20B parameter MMDiT

Same “flow-based edit model” category but completely different scales. Qwen Image Edit isn’t “slow because it’s flow-based” — it’s slow because a 20B-class model has poor compatibility with MPS/BF16. The bottleneck is the dtype constraint, not the model family.

Summary

ProblemCause
Generation takes 10 minutesMPS BF16 is 2x slower than FP16 (M1–M3 lack BF16 hardware)
FP16 produces black imagesMPS FP16 Attention bug since macOS 14.5
It was fast beforeComfyUI update changed Qwen Image Edit inference dtype from FP16 to BF16

With ComfyUI + PyTorch MPS at this point, --fp16-vae --fp16-unet at 2:30 is the ceiling.

Next: MLX

Many of today’s problems stem from the PyTorch MPS path rather than Qwen Image Edit itself. Switching to MLX, Apple Silicon’s native framework, could potentially avoid both BF16 emulation slowness and the FP16 Attention bug.

The MLX community has published a Qwen-Image 8bit quantized version, with reported speeds of ~8.5 seconds/step on M-series Macs. mflux also has an open issue for Qwen Image Edit support.

That said, “MLX will definitely be faster” isn’t guaranteed — whether it produces the same image quality and editing precision as the ComfyUI workflow is a separate question. A realistic approach would be keeping ComfyUI as the overall hub while running separate MLX benchmarks just for Qwen Image Edit.

Either way, MLX is near the top of viable options left for Mac. So I actually tried it.

mflux Real-World Results

Running Qwen Image Edit 2509 with mflux v0.17.2. Same M1 Max 64GB environment, 4 steps.

# Install
uv tool install mflux

# Run (8-bit quantized)
mflux-generate-qwen-edit \
  --image-paths input.png \
  --prompt "Change to summer clothes" \
  --steps 4 --guidance 1.0 \
  --quantize 8 \
  --output output.png
SettingStepsSamplingImage
BF16 full (58GB)44:08blurry
8-bit quantized42:18blurry
4-bit quantized42:12collapsed
8-bit quantized2010:44OK

4-step results had unusable quality. ComfyUI uses a Lightning LoRA (4-step optimized) + ModelSamplingAuraFlow + CFGNorm nodes setup, none of which have equivalents in mflux’s CLI. Getting quality from 4 steps without LoRA isn’t possible.

20-step Q8 hits usable quality, but takes 10:44.

mflux 4-step Q8 (blurry)

mflux 4-step Q8 hair color change (blurry)

mflux 4-bit (completely collapsed)

ComfyUI Custom Node Integration

Made mflux usable as ComfyUI nodes rather than CLI. A thin wrapper calling mflux via subprocess.

LoadImage → MfluxQwenImageEdit → SaveImage

mflux 20-step Q8 via ComfyUI node (winter→summer outfit)

20-step Q8 converted from winter to summer outfit while preserving character appearance. PyTorch MPS path is completely bypassed — no black images, no BF16 slowdown.

Solving the 4-Step Quality Issue with Lightning LoRA

mflux supports external LoRA application. Applying the Lightning LoRA that Rapid-AIO bakes in at inference time should enable 4-step quality.

mflux-generate-qwen-edit \
  --image-paths input.png \
  --prompt "Change to summer clothes" \
  --steps 4 --guidance 1.0 \
  --quantize 8 \
  --lora-paths "Qwen-Image-Edit-Lightning-4steps-V1.0-bf16.safetensors" \
  --lora-scales 1.0 \
  --output output.png

mflux Lightning LoRA + 4-step Q8 (winter→summer)

LoRA application dramatically improved quality. 720 layers matched, sampling 2:28. Completely different from the blurry 4-step without LoRA — character maintained while properly converting from winter to summer outfit.

Full Comparison

RuntimeModelSettingTimeImage
ComfyUI (PyTorch MPS)Edit 2511 BF16--fp16-vae / 4-step+LoRA3:59OK
ComfyUI (PyTorch MPS)Edit 2511 BF16--fp16-vae --fp16-unet / 4-step+LoRA2:33OK
mflux (MLX)Edit 2509 Q84-step no LoRA2:18blurry
mflux (MLX)Edit 2509 Q820-step no LoRA10:44OK
mflux (MLX)Edit 2509 Q84-step + Lightning LoRA2:28OK

mflux + Lightning LoRA + 8-bit quantization achieves nearly identical speed to ComfyUI (2:28 vs 2:33) while completely bypassing the PyTorch MPS path. No black images, no BF16 slowdown.

Currently only Edit 2509 is supported, but when Edit 2511 support arrives for mflux, the 2511 Lightning LoRA should become usable too. Integrating via custom node means no need to rebuild the UI.

Final Verdict

ComfyUI (best)mflux + LoRA
Time2:332:28
QualityOKOK
Black image riskOccurs with FP16None
BF16 performance hitYesNone
Patch required3 filesNone
ModelEdit 2511Edit 2509

Speed is nearly identical, but mflux avoids PyTorch MPS path issues. ComfyUI also carries the risk of patches needing to be reapplied after every update. Whether Edit 2509 vs 2511 has meaningful editing precision differences needs real-world comparison.

References