Tested LFM2.5-1.2B-JP-202606 on M1 Max 64GB. llama.cpp Q4_K_M: 208 tok/s decode, JSON intact, model name hallucinated (LFM→FDM). Q8_0: 157 tok/s, no hallucination. Tool calls broken via GGUF.
oMLX 0.3.9.dev2 release notes from the angle of Codex/Copilot on Mac local LLMs: Gemma 4 VLM MTP, DFlash, omlx launch copilot, SSD KV cache — what each changes for agent workflows.
Tested Klein 9B + 9B NSFW LoRA on M1 Max 64GB via mflux 0.17.5: 1m51s/512, 5m37s/1024 q4, 224/224 LoRA keys match, NSFW prompts uncensored, Japanese subjects work with helper tokens.
Tested Gemma 4 MTP drafter on M1 Max 64GB with mlx-vlm 0.5.0. Only the 26B A4B MoE got +13%; 31B Dense and E4B got slower. Code gen vs short haiku prompts flip the result.
Klein 4B / 9B / Base LoRAs aren't cross-compatible — a 9B NSFW LoRA throws 'lora key not loaded' on mflux's 4B path. The variant map, what mflux runs today, and where the working hands-on test lives.
Hands-on benchmark of FLUX.2 Klein 4B on M1 Max 64GB using mflux (MLX) and iris.c (pure C + Metal). A counter to Pruna AI's H100-only tutorial — measuring how fast Apple Silicon actually gets there.
Hands-on running inclusionAI Ling-flash-2.0 (100B / 6.1B active, MXFP4 quant, 54.7GB) on SwiftLM via mlx-swift-lm on an M1 Max 64GB. Covers bailing_moe + MXFP4 support check in mlx-swift, the startup surprise, and what --stream-experts actually saves.
A hands-on build and run of the Swift-based LLM inference server SwiftLM on an M1 Max 64GB. Covers Qwen3.6-35B-A3B and Qwen3.5-122B-A10B, with the same BST, BBS, and persona tests used in the existing Ollama and MLX-lm write-ups.
Tried Qwen3.6-27B on both Ollama and MLX. Ollama couldn't load the VL-projector-embedded GGUF, MLX ran it at 11 tok/s. On the side, running 35B-A3B under MLX was roughly 2× faster than the Ollama GGUF. Also had both models build a BBS to gauge intent handling.
A three-link chain of mmap → MTLBuffer(bytesNoCopy) → Wasmtime MemoryCreator that makes a Wasm linear memory share the same physical bytes as a Metal GPU buffer. Llama 3.2 1B runs at 9ms/token on M1.
Based on EE Times' interview with AMD AI Software VP Anush Elangovan, we assess the ROCm vs CUDA ecosystem gap. Includes hands-on experience with ROCm breaking four times on Strix Halo, plus practical guidance on choosing between NVIDIA, AMD, and Apple Silicon.