SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.
Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers use KV cache. Expanding ctx-size from 4096 to 65536 on llama-server added just 800MB VRAM with zero speed loss. Includes q8_0 KV quantization benchmarks and TurboQuant status.
Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.
Two February 2026 papers on reducing inference cost: Together AI’s Consistency DLM (up to 14.5× faster) and MIT/Harvard’s Attention Matching KV compaction (50× compaction in seconds).