A hands-on log of running Qwen-Scope's Sparse Autoencoder locally on M1 Max 64GB with Qwen3-8B-Base, extracting feature IDs that discriminate between Japanese, English, code, and Chinese from a single middle layer.
A hands-on run of the CUDA-free port of TRELLIS.2 (trellis-mac) on an M1 Max 64GB machine, recording setup, generation time, output quality, and where the bottlenecks actually sit.
A port that replaces TRELLIS.2's CUDA-only libraries (flash_attn, nvdiffrast, sparse 3D convolution) with pure-PyTorch equivalents and runs Microsoft's 4B image-to-3D model on an M4 Pro in about 3.5 minutes without any NVIDIA GPU.
Diagnosed a 7x speed regression for Qwen Image Edit on M1 Max 64GB ComfyUI after an update. Root cause: MPS BF16 matmul runs ~2x slower than FP16, compounded by an FP16 attention bug. Benchmark numbers and the working fix.
Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.
Upscaling images loaded via the Load Image node was producing garbled output. Fixed it by addressing the non-contiguous tensor issue — a one-line patch to comfy/utils.py. Added a 2026-04-29 follow-up after a ComfyUI update wiped the patch and the bug came back, with the upstream PyTorch issue and a recurrence-detection snippet.