Hands-on run of trellis-mac (the CUDA-free port of TRELLIS.2) on M1 Max 64GB. Setup via uv with PyTorch 2.11.0 MPS, applied mps_compat.py patches, and recorded actual generation time vs the M4 Pro 24GB 3.5-minute reference, plus where the bottlenecks land on Apple Silicon.
A port that replaces TRELLIS.2's CUDA-only libraries (flash_attn, nvdiffrast, sparse 3D convolution) with pure-PyTorch equivalents and runs Microsoft's 4B image-to-3D model on an M4 Pro in about 3.5 minutes without any NVIDIA GPU.
A three-link chain of mmap → MTLBuffer(bytesNoCopy) → Wasmtime MemoryCreator that makes a Wasm linear memory share the same physical bytes as a Metal GPU buffer. Llama 3.2 1B runs at 9ms/token on M1.