A hands-on run of the CUDA-free port of TRELLIS.2 (trellis-mac) on an M1 Max 64GB machine, recording setup, generation time, output quality, and where the bottlenecks actually sit.
A port that replaces TRELLIS.2's CUDA-only libraries (flash_attn, nvdiffrast, sparse 3D convolution) with pure-PyTorch equivalents and runs Microsoft's 4B image-to-3D model on an M4 Pro in about 3.5 minutes without any NVIDIA GPU.
Based on EE Times' interview with AMD AI Software VP Anush Elangovan, we assess the ROCm vs CUDA ecosystem gap. Includes hands-on experience with ROCm breaking four times on Strix Halo, plus practical guidance on choosing between NVIDIA, AMD, and Apple Silicon.
Diagnosed a 7x speed regression for Qwen Image Edit on M1 Max 64GB ComfyUI after an update. Root cause: MPS BF16 matmul runs ~2x slower than FP16, compounded by an FP16 attention bug. Benchmark numbers and the working fix.
Upscaling images loaded via the Load Image node was producing garbled output. Fixed it by addressing the non-contiguous tensor issue — a one-line patch to comfy/utils.py. Added a 2026-04-29 follow-up after a ComfyUI update wiped the patch and the bug came back, with the upstream PyTorch issue and a recurrence-detection snippet.