Tracking Down Why Qwen Image Edit Started Taking 10 Minutes After a ComfyUI Update
In a previous article, I was generating images with Qwen Image Edit in about 80 seconds on an M1 Max 64GB.
Then I updated ComfyUI, and the same environment started taking nearly 10 minutes. I thought changing startup options might fix it, but that wasn’t the issue.
Environment
- Mac Studio M1 Max 64GB
- macOS 26.3 (Tahoe)
- Python 3.12.12
- PyTorch 2.10.0
- ComfyUI v0.16.4
Initial Suspicion: —gpu-only
When I wrote the Qwen article, the startup command was just python main.py --fp16-vae. After that, I’d added --gpu-only to an alias for Illustrious speed improvements.
python main.py --gpu-only --fp16-vae
--gpu-only is unnecessary on Apple Silicon (CPU and GPU share the same physical memory with unified memory), so I removed it, but sampling time barely changed. Not a startup option issue.
Benchmark Reveals: BF16 Is 2x Slower on MPS
Nothing was improving despite trying various things, so I benchmarked matmul performance directly on MPS.
import torch, time
device = torch.device('mps')
a = torch.randn(4096, 4096, dtype=torch.bfloat16, device=device)
b = torch.randn(4096, 4096, dtype=torch.bfloat16, device=device)
# ... (average of 10 runs)
| dtype | 4096x4096 matmul | Relative speed |
|---|---|---|
| FP16 | 14.5ms | 1.0x (fastest) |
| FP32 | 15.9ms | 1.1x |
| BF16 | 30.3ms | 2.1x (slowest) |
BF16 is 2x slower than FP16. Even slower than FP32. Same result on PyTorch 2.4.1, so it’s not a version issue — MPS’s BF16 implementation itself is slow.
M1/M2/M3 chips lack hardware acceleration for BF16 (native support starts with M4), so it’s running through software emulation.
Qwen Image Edit’s diffusion model runs with model weight dtype torch.bfloat16. Meaning all operations run through this 2x-slower BF16.
Forcing FP16 and the Results
I tried --fp16-unet to force FP16.

Completely black output. Since macOS 14.5, PyTorch’s MPS backend has a bug where FP16 Attention outputs NaN. ComfyUI handles this:
# comfy/model_management.py
def force_upcast_attention_dtype():
macos_version = mac_version()
if macos_version is not None and ((14, 5) <= macos_version):
upcast = True
if upcast:
return {torch.float16: torch.float32} # upcast FP16 Attention to FP32
FP16 model → upcast only Attention to FP32 → correct image should appear… but for some models (especially AIO models loaded with CheckpointLoaderSimple), upcast doesn’t work and you get black images.
Root Cause: Investigating ComfyUI Commits
Digging through git log, I found two relevant commits.
Commit 1: 96d891cb (2025-02-24)
# Before: upcast all Attention dtypes to FP32
return torch.float32
# After: only upcast FP16 to FP32, leave BF16 as-is
return {torch.float16: torch.float32}
Commit 2: 6e28a464 (2025-06-10)
# Before: only upcast for macOS 14.5–15.x (exclude 16+)
if (14, 5) <= macos_version < (16,):
# After: upcast for all macOS 14.5+
if (14, 5) <= macos_version:
My initial hypothesis was “BF16 Attention’s FP32 upcast was removed, causing slowdown,” but when I actually re-enabled BF16→FP32 upcast, it got 1.6x slower (5:29 vs 3:59). The Attention intermediate tensors are massive, so the conversion cost and memory bandwidth increase become the bottleneck. Benchmark numbers don’t map directly.
So the reason it was fast before wasn’t FP32 upcast. The model was probably running in FP16 before. FP32 upcast was for preventing black images, not the speed source.
At some point during ComfyUI updates, Qwen Image Edit’s default inference dtype changed from FP16 to BF16, which combined with M1–M3’s lack of BF16 hardware support to cause the slowdown — that’s the most coherent explanation.
Full Test Results
Split model (qwen_image_edit_2511_bf16.safetensors / 38GB)
| Setting | Attention | 1 step | Sampling (4 steps) | Image |
|---|---|---|---|---|
--fp16-vae | sub-quadratic (BF16) | ~51s | 3:59 | OK |
--fp16-vae --fp16-unet | sub-quadratic (FP32 upcast) | ~38s | 2:33 | OK |
--fp16-vae --fp16-unet --use-split-cross-attention | split (FP32 upcast) | ~128s | gave up | - |
--fp16-vae + upcast disabled | sub-quadratic (FP16) | ~25s | 1:40 | black |

AIO model (Qwen-Rapid-AIO-NSFW-v16 / 26GB FP8)
FP8 models aren’t supported on MPS, so a patch to convert to BF16 on load is required.
| Conversion target | 1 step | Sampling (4 steps) | Image |
|---|---|---|---|
| FP16 | ~25s | 1:59 | black |
| BF16 | ~55s | 3:40 | OK |

GGUF (qwen-image-edit-2511-Q8_0.gguf / 20GB)
| Setting | 1 step | Sampling (4 steps) | Image |
|---|---|---|---|
--fp16-vae | ~48s | 3:12 | blurry |

Current Best Settings
python main.py --fp16-vae --fp16-unet
Use with the split model (qwen_image_edit_2511_bf16.safetensors). FP16 linear + FP32 Attention upcast combination gives 2:33 sampling.
Doesn’t get back to 80 seconds, but a big improvement from 10 minutes.
Patch for AIO Models
FP8 models can’t run directly on MPS, so they need to be converted to BF16 on load. Modify these 2 files:
comfy/sd.py
Add after weight_dtype in the load_state_dict_guess_config function:
# MPS does not support float8 types - convert FP8 weights to BF16 upfront
if model_management.is_device_mps(load_device):
fp8_types = model_management.FLOAT8_TYPES
converted = False
for k in sd:
if sd[k].dtype in fp8_types:
sd[k] = sd[k].to(torch.bfloat16)
converted = True
if converted:
weight_dtype = comfy.utils.weight_dtype(sd, diffusion_model_prefix)
logging.info("Converted FP8 weights to BF16 for MPS compatibility")
comfy/model_management.py
Add at the start of the cast_to function:
# MPS does not support float8 types - cast to bf16 on CPU first
if device is not None and is_device_mps(device) and weight.dtype in FLOAT8_TYPES:
if dtype is None or dtype in FLOAT8_TYPES:
dtype = torch.bfloat16
weight = weight.to(dtype=dtype)
copy = False
Note: Converting to FP16 produces black images. Always use BF16.
comfy/supported_models.py
Add FP16 to supported_inference_dtypes in the QwenImage class (required when using --fp16-unet):
supported_inference_dtypes = [torch.bfloat16, torch.float16, torch.float32]
What the AI Assistants Suggested
I also asked Gemini and ChatGPT about this issue. I verified every suggestion on real hardware:
Gemini’s Suggestions
| Suggestion | Actual result |
|---|---|
--force-fp16 | Black images (MPS FP16 Attention bug) |
--use-split-cross-attention | 2.5x slower (128s/step) |
--fp32-vae | Unnecessary. --fp16-vae works fine |
--disable-smart-memory | No effect in Mac SHARED mode |
| GGUF Q4 model | Same speed, quality degraded |
TORCH_MATH_DISABLE_SDPA=1 | Non-existent environment variable (hallucination) |
Three exchanges, zero reproducible improvements from real testing.
ChatGPT’s Suggestions
ChatGPT’s analysis more accurately reflected MPS/BF16 constraints. Particularly “M1–M3 lack BF16 hardware support” and “ComfyUI commits are the cause” were on target.
The only concrete suggestion was “upcasting BF16 Attention to FP32 should speed it up,” but as measured, it actually got 1.6x slower. The theory that it should be faster based on benchmark numbers doesn’t match when massive Attention intermediate tensor memory bandwidth dominates. Theory and measurement don’t always agree.
”It’s Slow Because It’s Flux-Based” — Not Quite
ComfyUI recognizes Qwen Image Edit as model_type FLUX. From that, it’s tempting to conclude “it’s slow because it’s Flux-based,” but that’s not accurate.
- FLUX.1 Kontext: 12B parameter rectified flow transformer
- Qwen Image Edit: 20B parameter MMDiT
Same “flow-based edit model” category but completely different scales. Qwen Image Edit isn’t “slow because it’s flow-based” — it’s slow because a 20B-class model has poor compatibility with MPS/BF16. The bottleneck is the dtype constraint, not the model family.
Summary
| Problem | Cause |
|---|---|
| Generation takes 10 minutes | MPS BF16 is 2x slower than FP16 (M1–M3 lack BF16 hardware) |
| FP16 produces black images | MPS FP16 Attention bug since macOS 14.5 |
| It was fast before | ComfyUI update changed Qwen Image Edit inference dtype from FP16 to BF16 |
With ComfyUI + PyTorch MPS at this point, --fp16-vae --fp16-unet at 2:30 is the ceiling.
Next: MLX
Many of today’s problems stem from the PyTorch MPS path rather than Qwen Image Edit itself. Switching to MLX, Apple Silicon’s native framework, could potentially avoid both BF16 emulation slowness and the FP16 Attention bug.
The MLX community has published a Qwen-Image 8bit quantized version, with reported speeds of ~8.5 seconds/step on M-series Macs. mflux also has an open issue for Qwen Image Edit support.
That said, “MLX will definitely be faster” isn’t guaranteed — whether it produces the same image quality and editing precision as the ComfyUI workflow is a separate question. A realistic approach would be keeping ComfyUI as the overall hub while running separate MLX benchmarks just for Qwen Image Edit.
Either way, MLX is near the top of viable options left for Mac. So I actually tried it.
mflux Real-World Results
Running Qwen Image Edit 2509 with mflux v0.17.2. Same M1 Max 64GB environment, 4 steps.
# Install
uv tool install mflux
# Run (8-bit quantized)
mflux-generate-qwen-edit \
--image-paths input.png \
--prompt "Change to summer clothes" \
--steps 4 --guidance 1.0 \
--quantize 8 \
--output output.png
| Setting | Steps | Sampling | Image |
|---|---|---|---|
| BF16 full (58GB) | 4 | 4:08 | blurry |
| 8-bit quantized | 4 | 2:18 | blurry |
| 4-bit quantized | 4 | 2:12 | collapsed |
| 8-bit quantized | 20 | 10:44 | OK |
4-step results had unusable quality. ComfyUI uses a Lightning LoRA (4-step optimized) + ModelSamplingAuraFlow + CFGNorm nodes setup, none of which have equivalents in mflux’s CLI. Getting quality from 4 steps without LoRA isn’t possible.
20-step Q8 hits usable quality, but takes 10:44.



ComfyUI Custom Node Integration
Made mflux usable as ComfyUI nodes rather than CLI. A thin wrapper calling mflux via subprocess.
LoadImage → MfluxQwenImageEdit → SaveImage

20-step Q8 converted from winter to summer outfit while preserving character appearance. PyTorch MPS path is completely bypassed — no black images, no BF16 slowdown.
Solving the 4-Step Quality Issue with Lightning LoRA
mflux supports external LoRA application. Applying the Lightning LoRA that Rapid-AIO bakes in at inference time should enable 4-step quality.
mflux-generate-qwen-edit \
--image-paths input.png \
--prompt "Change to summer clothes" \
--steps 4 --guidance 1.0 \
--quantize 8 \
--lora-paths "Qwen-Image-Edit-Lightning-4steps-V1.0-bf16.safetensors" \
--lora-scales 1.0 \
--output output.png

LoRA application dramatically improved quality. 720 layers matched, sampling 2:28. Completely different from the blurry 4-step without LoRA — character maintained while properly converting from winter to summer outfit.
Full Comparison
| Runtime | Model | Setting | Time | Image |
|---|---|---|---|---|
| ComfyUI (PyTorch MPS) | Edit 2511 BF16 | --fp16-vae / 4-step+LoRA | 3:59 | OK |
| ComfyUI (PyTorch MPS) | Edit 2511 BF16 | --fp16-vae --fp16-unet / 4-step+LoRA | 2:33 | OK |
| mflux (MLX) | Edit 2509 Q8 | 4-step no LoRA | 2:18 | blurry |
| mflux (MLX) | Edit 2509 Q8 | 20-step no LoRA | 10:44 | OK |
| mflux (MLX) | Edit 2509 Q8 | 4-step + Lightning LoRA | 2:28 | OK |
mflux + Lightning LoRA + 8-bit quantization achieves nearly identical speed to ComfyUI (2:28 vs 2:33) while completely bypassing the PyTorch MPS path. No black images, no BF16 slowdown.
Currently only Edit 2509 is supported, but when Edit 2511 support arrives for mflux, the 2511 Lightning LoRA should become usable too. Integrating via custom node means no need to rebuild the UI.
Final Verdict
| ComfyUI (best) | mflux + LoRA | |
|---|---|---|
| Time | 2:33 | 2:28 |
| Quality | OK | OK |
| Black image risk | Occurs with FP16 | None |
| BF16 performance hit | Yes | None |
| Patch required | 3 files | None |
| Model | Edit 2511 | Edit 2509 |
Speed is nearly identical, but mflux avoids PyTorch MPS path issues. ComfyUI also carries the risk of patches needing to be reapplied after every update. Whether Edit 2509 vs 2511 has meaningful editing precision differences needs real-world comparison.
References
- mflux — MLX-native image generation for Apple Silicon
- mflux Qwen Image Edit 2511 support issue #298
- Qwen-Image-Edit-2509 — Official Qwen Edit 2509 model
- mlx-community/Qwen-Image-2512-8bit — MLX quantized version
- ComfyUI commit 96d891cb — BF16 Attention upcast change
- ComfyUI commit 6e28a464 — macOS FP16 Attention workaround expanded