Tested LFM2.5-1.2B-JP-202606 on M1 Max 64GB. llama.cpp Q4_K_M: 208 tok/s decode, JSON intact, model name hallucinated (LFM→FDM). Q8_0: 157 tok/s, no hallucination. Tool calls broken via GGUF.
35M linear projection replaces E4B's 150M 16-layer Vision Encoder. Bidirectional attention in the 48-layer LLM absorbs patch features. Comparison with Fuyu, EVE, EVEv2, and Mono-InternVL.
Hands-on with Tencent Hy-MT2 1.8B Q4_K_M (1.08GB) on M1 Max 64GB via llama-server. JSON, SRT, HTML, glossary, and minority-language prompts with full input-output pairs. The 1.25bit 440MB build does not load on stock llama.cpp 8990, and 30B-A3B (hy_v3) is not in the Mac route yet.
oMLX 0.3.9.dev2 release notes from the angle of Codex/Copilot on Mac local LLMs: Gemma 4 VLM MTP, DFlash, omlx launch copilot, SSD KV cache — what each changes for agent workflows.
Out-of-bounds read in Ollama's GGUF loader before 0.17.1. If your Ollama API is network-accessible, a crafted model file can exfiltrate env vars, API keys, system prompts, and conversation fragments from process memory.
After Xiaomi MiMo-V2.5's weights went public, I checked whether it runs on Mac/ROCm or on cloud GPU (RunPod/GCE). It's still rough on local hardware, but RunPod's 4x H200 runs it for ~$14/hr and GCE Spot H100 brings it down to ~$1.6/hr.
Hands-on running inclusionAI Ling-flash-2.0 (100B / 6.1B active, MXFP4 quant, 54.7GB) on SwiftLM via mlx-swift-lm on an M1 Max 64GB. Covers bailing_moe + MXFP4 support check in mlx-swift, the startup surprise, and what --stream-experts actually saves.
A hands-on build and run of the Swift-based LLM inference server SwiftLM on an M1 Max 64GB. Covers Qwen3.6-35B-A3B and Qwen3.5-122B-A10B, with the same BST, BBS, and persona tests used in the existing Ollama and MLX-lm write-ups.
Two open-weight Chinese MoEs landed within 24 hours: Ant Ling-2.6-flash (104B/7.4B active, 7x token-efficiency claim) and Tencent Hy3-preview (295B/21B active, frontier-tier open weights). Specs, licenses, and how they line up against DeepSeek-V3 and GLM-4.5.
Tried Qwen3.6-27B on both Ollama and MLX. Ollama couldn't load the VL-projector-embedded GGUF, MLX ran it at 11 tok/s. On the side, running 35B-A3B under MLX was roughly 2× faster than the Ollama GGUF. Also had both models build a BBS to gauge intent handling.
Hands-on Qwen3.6-35B-A3B (23GB 4bit GGUF) on M1 Max 64GB via Ollama 0.20.6. Generation speed stays at 27 tok/s — same as Qwen3.5-35B-A3B — but the same prompt produces 13× more thinking tokens. Multi-turn behavior, persona handling, and a three-tier NSFW probe included.