Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.
How should memory be allocated in reasoning models? This paper explains the trade-offs among quantization, KV cache, and test-time compute, based on 1,700 experiments.