SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.
Ollama 0.19 switches the Apple Silicon backend to MLX, achieving 1,810 tokens/s prefill and 112 tokens/s decode. NVFP4 quantization support and cache improvements landed at the same time.
Diagnosed a 7x speed regression for Qwen Image Edit on M1 Max 64GB ComfyUI after an update. Root cause: MPS BF16 matmul runs ~2x slower than FP16, compounded by an FP16 attention bug. Benchmark numbers and the working fix.
Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.
Local video generation test on M1 Max 64GB: FP8 fails on Metal, GGUF gets Wan 2.2 running at 82 minutes for a 2-second clip, and LTX-2 hits NaN or unusable KSampler output on MPS.
Upscaling images loaded via the Load Image node was producing garbled output. Fixed it by addressing the non-contiguous tensor issue — a one-line patch to comfy/utils.py. Added a 2026-04-29 follow-up after a ComfyUI update wiped the patch and the bug came back, with the upstream PyTorch issue and a recurrence-detection snippet.
A derivative checkpoint of Z-Image Turbo released on ModelScope. It is tuned for skin texture and film-photography-like aesthetics, and can run on an M1 Max with 64GB.
A look at ACE-Step, the 'Stable Diffusion of music,' covering its architecture, features, installation, and expected performance on Apple Silicon before trying it on an M1 Max.
An overview of Z-Image-Distilled, the distilled fast-inference variant of Z-Image, including how it compares with FLUX.1 Schnell, how it runs on an M1 Max 64GB machine, and LoRA compatibility.
Overview of Black Forest Labs' FLUX.2 Klein 9B model and how it performs on M1/M2/M3/M4 Macs. Covers the key factors behind the CUDA vs MPS performance gap, including memory bandwidth and FP8 quantization.