Alibaba's Qwen team released Qwen3.6-35B-A3B as open weights. A 40-layer hybrid of Gated DeltaNet, Gated Attention, and MoE hits 73.4 on SWE-bench Verified, 37.0 on MCPMark, and 1397 on QwenWebBench.
Zhipu AI releases GLM-5.1, a 744B MoE (40B active) model achieving 58.4% SOTA on SWE-Bench Pro. Its standout feature is sustained performance across 8-hour sessions with 6,000+ tool calls—no degradation.
Benchmarking NII's LLM-jp-4-32B-A3B-thinking on EVO-X2 (Ryzen AI Max+ 395) with ROCm. 62.9 t/s vs Qwen3.5-35B-A3B's 44.7 t/s. Covers thinking control issues, KV cache trade-offs, knowledge cutoff, Japanese quality comparisons, code generation tests, and training data composition.
Google DeepMind has released Gemma 4: four models—31B dense, 26B MoE (A4B), E4B, and E2B—with a 256K context, multimodal input, tool calling, and support for 140 languages.
SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.
Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.