Fujitsu's PHOTON claims up to 475x over Transformers, but that's tokens/s/GiB (multi-query memory throughput), not faster single responses. What the 1.2B paper tables, the quality drop, and 9-query integration really show.
Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.
Redesigned with inference latency as the first priority, Mamba‑3 combines exponential trapezoid discretization, complex‑valued states, and a MIMO structure to reach about 6.9× the speed of a Transformer at 16,384 tokens.
How should memory be allocated in reasoning models? This paper explains the trade-offs among quantization, KV cache, and test-time compute, based on 1,700 experiments.