#推論

2 articles

Tech May 7, 2026 8 min

Gemma 4 MTP drafter on M1 Max 64GB: 26B A4B +13%, 31B Dense and E4B got slower

Tested Gemma 4 MTP drafter on M1 Max 64GB with mlx-vlm 0.5.0. Only the 26B A4B MoE got +13%; 31B Dense and E4B got slower. Code gen vs short haiku prompts flip the result.

AI LLM Google Gemma ローカルLLM 推論 MLX 実験

Tech May 6, 2026 updated 9 min

Gemma 4 MTP drafter: 3x speedup for Dense, limited gains on 26B MoE at batch 1

Reading Google's MTP drafter docs, vLLM recipes, and the AI for Developers guide. The 3x claim holds for 31B Dense but 26B A4B MoE stalls at batch 1 because speculative decoding verification loads extra expert weights per candidate token.

AI LLM Google Gemma ローカルLLM 推論