#KV Cache

5 articles

TechJun 29, 20267 min

Fujitsu's PHOTON 475x is multi-query memory throughput, not single-shot speed

Fujitsu's PHOTON claims up to 475x over Transformers, but that's tokens/s/GiB (multi-query memory throughput), not faster single responses. What the 1.2B paper tables, the quality drop, and 9-query integration really show.

AI LLM Inference Japanese AI KV Cache

TechApr 2, 2026updated13 min

SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders

SwiftLM, an Apple Silicon–only MLX inference server, provides a native Metal implementation of TurboQuant V2+V3 hybrid KV‑cache compression and NVMe SSD expert streaming.

Apple Silicon LLM MLX Local LLM Inference Optimization KV Cache MoE Swift

TechMar 31, 2026updated8 min

Qwen3.5-35B-A3B on llama-server (Vulkan + Strix Halo): 4K → 65K context for only 800MB more VRAM

Qwen3.5-35B-A3B is an SSM+Attention hybrid where only 10 of 40 layers consume KV cache. Going from ctx-size 4096 to 65536 on llama-server + Vulkan added just 800MB VRAM with zero throughput loss. Tested on Strix Halo (Ryzen AI Max+ 395), with q8_0 KV quant benchmarks.

LLM Local LLM llama.cpp AMD Vulkan KV Cache Qwen Benchmark

TechMar 25, 202617 min

Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization

Hypura breaks away from llama.cpp’s mmap design and streams even dense models with a three-tier NVMe placement, while TurboQuant eliminates quantization-constant overhead via a polar-coordinate transform. Includes a design comparison with Flash‑MoE and a review of scenarios where KV‑cache compression actually helps.

LLM Local LLM Quantization Apple Silicon Inference Optimization KV Cache Rust

TechFeb 20, 2026updated13 min

Accelerating LLM Inference: CDLM and Attention Matching KV Compaction

Two February 2026 papers on reducing inference cost: Together AI’s Consistency DLM (up to 14.5× faster) and MIT/Harvard’s Attention Matching KV compaction (50× compaction in seconds).

AI LLM Inference Optimization KV Cache Diffusion models