#MPS

8 articles

TechJun 25, 202613 min

Krea 2 on M1 Max ComfyUI: Turbo runs in 3.5 min, Raw NaNs to black at 47 min

Tested Krea 2 Raw and Turbo on M1 Max 64GB ComfyUI. Turbo bf16 runs ~3.5 min/image, fp8 is rejected on MPS, and Raw's 52-step+CFG NaNs to a black image after 47 min. Plus quality, NSFW behavior, and license.

AI Image Generation ComfyUI Apple Silicon MPS Experiment

TechJun 18, 20268 min

Boogu-Image-0.1 on M1 Max ComfyUI: fp8 fails on MPS, bf16 works at ~70s/image

Tested Boogu-Image-0.1 on an M1 Max 64GB ComfyUI: the fp8 build is rejected by MPS, so bf16 is mandatory, and Turbo runs ~70s per 1024px image. Notes on photoreal vs anime, bilingual text, and where NSFW stops.

AI Image Generation ComfyUI Apple Silicon MPS Experiment

TechMay 2, 202620 min

Running Qwen-Scope's SAE on M1 Max 64GB to Extract a Japanese-Language Feature

A hands-on log of running Qwen-Scope's Sparse Autoencoder locally on M1 Max 64GB with Qwen3-8B-Base, extracting feature IDs that discriminate between Japanese, English, code, and Chinese from a single middle layer.

AI LLM Qwen 解釈可能性実験 Apple Silicon MPS

TechApr 21, 2026updated19 min

TRELLIS.2 trellis-mac port tested on M1 Max 64GB: setup, generation time, MPS bottlenecks

Hands-on run of trellis-mac (the CUDA-free port of TRELLIS.2) on M1 Max 64GB. Setup via uv with PyTorch 2.11.0 MPS, applied mps_compat.py patches, and recorded actual generation time vs the M4 Pro 24GB 3.5-minute reference, plus where the bottlenecks land on Apple Silicon.

AppleSilicon MPS PyTorch 3D ML 実験

TechApr 20, 2026updated9 min

Running TRELLIS.2 on Apple Silicon MPS: a CUDA-free port

A port that replaces TRELLIS.2's CUDA-only libraries (flash_attn, nvdiffrast, sparse 3D convolution) with pure-PyTorch equivalents and runs Microsoft's 4B image-to-3D model on an M4 Pro in about 3.5 minutes without any NVIDIA GPU.

AppleSilicon MPS PyTorch 3D ローカルLLM ML

TechMar 26, 2026updated11 min

Qwen Image Edit on M1 Max went 80s→10min after a ComfyUI update: MPS BF16 is the cause

Diagnosed a 7x speed regression for Qwen Image Edit on M1 Max 64GB ComfyUI after an update. Root cause: MPS BF16 matmul runs ~2x slower than FP16, compounded by an FP16 attention bug. Benchmark numbers and the working fix.

ComfyUI Qwen Apple Silicon MPS PyTorch Experiment

TechMar 23, 20267 min

Flash-MoE: Running a 397B-parameter model on a 48GB MacBook

Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.

Inference MPS LLM Qwen MoE Local LLM

TechFeb 13, 2026updated5 min

Fixing Corrupted ComfyUI Upscale Output on Mac MPS with contiguous()

Upscaling images loaded via the Load Image node was producing garbled output. Fixed it by addressing the non-contiguous tensor issue — a one-line patch to comfy/utils.py. Added a 2026-04-29 follow-up after a ComfyUI update wiped the patch and the bug came back, with the upstream PyTorch issue and a recurrence-detection snippet.

ComfyUI Apple Silicon PyTorch MPS Experiment