#MoE

20 articles

TechApr 21, 2026updated11 min

Qwen3.6-35B-A3B on M1 Max via Ollama 0.20.6: 27 tok/s same as 3.5, but 13× thinking tokens

Hands-on Qwen3.6-35B-A3B (23GB 4bit GGUF) on M1 Max 64GB via Ollama 0.20.6. Generation speed stays at 27 tok/s — same as Qwen3.5-35B-A3B — but the same prompt produces 13× more thinking tokens. Multi-turn behavior, persona handling, and a three-tier NSFW probe included.

LLM Local LLM Qwen Ollama Apple Silicon MoE Experiment

TechApr 21, 2026updated11 min

Qwen3.6-Max-Preview and Kimi K2.6 landed nearly back-to-back — lining up both flagship coding models

Alibaba's Qwen3.6-Max-Preview and Moonshot AI's Kimi K2.6 were released within a 24-hour window on April 20–21, 2026. A side-by-side look at specs, benchmarks, distribution, and agent-side features for the two flagships.

LLM Qwen Kimi Moonshot AI MoE Agent Coding

TechApr 17, 2026updated10 min

Qwen3.6-35B-A3B pairs Gated DeltaNet with MoE and raises the bar on agentic coding

Alibaba's Qwen team released Qwen3.6-35B-A3B as open weights. A 40-layer hybrid of Gated DeltaNet, Gated Attention, and MoE hits 73.4 on SWE-bench Verified, 37.0 on MCPMark, and 1397 on QwenWebBench.

LLM Local LLM Qwen MoE Agent Coding

TechApr 8, 2026updated8 min

GLM-5.1 (Zhipu, 744B / 40B MoE, MIT): 58.4% SOTA on SWE-Bench Pro, 8h / 6,000+ tool calls without degradation

Zhipu AI's GLM-5.1 is a 744B MoE (40B active, 200K context, MIT) targeting long-horizon agent tasks. Hits 58.4% SOTA on SWE-Bench Pro (edging out GPT-5.4 and Claude Opus 4.6) and sustains performance across 8-hour sessions with 6,000+ tool calls without degradation.

AI LLM Chinese AI MoE Open Model AI Agent

TechApr 6, 202611 min

LLM-jp-4-32B-A3B on ROCm + Strix Halo: 41% Faster Than Qwen3.5

Benchmarking NII's LLM-jp-4-32B-A3B-thinking on EVO-X2 (Ryzen AI Max+ 395) with ROCm. 62.9 t/s vs Qwen3.5-35B-A3B's 44.7 t/s. Covers thinking control issues, KV cache trade-offs, knowledge cutoff, Japanese quality comparisons, code generation tests, and training data composition.

AI LLM Local LLM llama.cpp AMD ROCm MoE Qwen Experiment

TechApr 3, 2026updated21 min

Google's Gemma 4 launches in four sizes (E2B–A4B), publishing Gemini 3–derived reasoning under Apache 2.0

Google DeepMind has released Gemma 4: four models—31B dense, 26B MoE (A4B), E4B, and E2B—with a 256K context, multimodal input, tool calling, and support for 140 languages.

AI LLM Google Open Model MoE Multimodal Local LLM

TechApr 2, 2026updated13 min

SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders

SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.

Apple Silicon LLM MLX Local LLM Inference Optimization KV Cache MoE Swift

TechMar 23, 20267 min

Flash-MoE: Running a 397B-parameter model on a 48GB MacBook

Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.

Inference MPS LLM Qwen MoE Local LLM