#Inference

4 articles

Tech Mar 23, 2026 7 min

Flash-MoE: Running a 397B-parameter model on a 48GB MacBook

Flash-MoE is a C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max at 4.36 tokens/s. With expert streaming from SSD and hand-written Metal shaders, it fits the 209GB model into a 48GB memory budget.

Inference MPS LLM Qwen MoE Local LLM

Tech Mar 22, 2026 13 min

Together AI announces Mamba-3: ~7x faster long-context inference than Transformers, with complex-valued SSM

Redesigned with inference latency as the first priority, Mamba‑3 combines exponential trapezoid discretization, complex‑valued states, and a MIMO structure to reach about 6.9× the speed of a Transformer at 16,384 tokens.

SSM LLM Inference Architecture

Tech Feb 2, 2026 4 min

Power Sampling: unlocking LLM reasoning without reinforcement learning

A look at how changing the inference-time sampling strategy can improve LLM reasoning performance without retraining on RL.

LLM Inference Reinforcement Learning Sampling AI

Tech Jan 30, 2026 5 min

Not All Bits Are Equal: There is no universal solution for memory allocation in reasoning models

How should memory be allocated in reasoning models? This paper explains the trade-offs among quantization, KV cache, and test-time compute, based on 1,700 experiments.

LLM Quantization Inference Research