Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization
I just introduced SSD expert streaming in the Flash‑MoE article, but there’s another project tackling the same problem with a different design. Hypura is a Rust-based, Apple Silicon–only LLM inference scheduler that runs llama.cpp GGUF models via NVMe streaming. Unlike Flash‑MoE, which is MoE‑only, Hypura also supports streaming FFN layers for dense models.
The other, TurboQuant, is a 3‑bit quantization method for the KV cache from Google Research (accepted to ICLR 2026). That said, KV‑cache compression becomes a bottleneck mainly in highly parallel server-side inference; if you’re using Ollama or LM Studio alone, it’s rarely an issue. With that premise in mind, let’s look into both algorithms.
Hypura: Breaking past llama.cpp’s memory limits with a three‑tier NVMe scheduler
Structural constraints caused by llama.cpp’s mmap design
llama.cpp passes a GGUF file to Metal as a single mmap. On M1 Max, the recommended maximum GPU working set size (recommendedMaxWorkingSetSize) is 26.8 GB. A 30.9 GB Mixtral 8x7B Q5_K_M immediately exceeds it and crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory.
No amount of tweaking n_gpu_layers avoids this. llama.cpp’s CPU/GPU offload is decided per layer, but because the entire model is reserved as one big mmap, Metal rejects it the moment the total mapped size exceeds the GPU working set limit. Even the per‑tensor offload via --override-tensor I tried in the BERT+Qwen OCR correction tool article doesn’t change the total mapped size.
Hypura abandons this one‑shot mmap and switches to choosing a placement per tensor.
Tensor placement optimization with LP + greedy
At startup, Hypura profiles the hardware. It measures the GPU working set limit via the Metal API, system RAM capacity, and NVMe sequential read bandwidth, and uses these three numbers as constraints for tensor placement.
Placement is decided in two stages using LP (linear programming) and a greedy algorithm.
- Take each tensor’s size and access pattern (per‑token or conditional) as inputs
- Use LP to find the ideal placement that minimizes the sum of access latencies, subject to GPU/RAM/NVMe capacity constraints
- Use greedy rounding to convert the LP solution into a feasible integer assignment (tensors can’t be split, so the continuous LP values must be discretized)
Placement priorities are based on access pattern and size.
| Tensor type | Size/frequency | Placement |
|---|---|---|
| Attention / norm / embedding | Small, per‑token | Keep on GPU |
| MoE shared parts (router, etc.) | Medium, per‑token | GPU (if it fits) |
| FFN / MoE experts | Large, conditional | NVMe |
| Overflow | — | RAM |
Whereas llama.cpp’s n_gpu_layers is a binary choice of “entire layer on GPU or CPU,” Hypura can, for example, keep only the attention tensors on GPU and place FFN tensors on NVMe. This finer granularity is one reason it’s faster than llama.cpp in full‑resident mode (1.4× on Qwen 14B).
How its design differs from Flash‑MoE
Both Flash‑MoE and Hypura share the core idea of “read models that don’t fit from SSD/NVMe,” but they start from different design points.
| Flash‑MoE | Hypura | |
|---|---|---|
| Implementation | C + hand‑written Metal shaders | Rust + llama.cpp fork |
| Model format | Directly reads Safetensors | GGUF |
| Target models | MoE‑only | MoE + dense |
| Cache | Leave to OS page cache (71% hit rate) | Own LRU cache (99.5% hit rate) |
| Tensor placement | Manual (keep 5.5 GB of non‑experts resident in RAM) | LP+greedy automatic optimization |
| Prefetch | Tried and backfired (bandwidth contention on unified memory) | co‑activation prefetch worked |
| API compatibility | None | Ollama‑compatible HTTP server |
The caching strategies are noteworthy. Flash‑MoE relies on the OS page cache and reports a 71% hit rate—a deliberate choice to keep code simple by avoiding explicit cache management. Hypura implements its own LRU cache (Least Recently Used) and reports a 99.5% hit rate.
This gap likely stems from differences in MoE architectures. The Qwen3.5 model targeted by Flash‑MoE activates 4 of 512 experts, so the set of possible experts is large relative to cache space and the OS page cache tends to evict useful pages. Mixtral 8x7B activates 2 of 8 experts, so there are far fewer expert types, making an LRU cache more effective.
Prefetching outcomes also diverge. Flash‑MoE’s predictive expert routing achieved only 31% accuracy and wasn’t practical; prefetching itself worsened performance by 73% due to contention between GPU compute and I/O bandwidth on Apple Silicon’s unified memory bus. Hypura’s co‑activation prefetch likely succeeds because the approach differs: instead of predicting a single next expert, it tracks expert co‑activation patterns (given one expert fires, which experts tend to fire in the next layer) and prefetches multiple likely candidates. Casting a wider candidate set improves prefetch hit rate.
NVMe I/O choices and mmap pitfalls
Hypura uses pread with F_NOCACHE for NVMe access. pread is a thread‑safe system call to read from an arbitrary file offset; F_NOCACHE bypasses the OS page cache (direct I/O).
The reason to choose direct I/O is that tensor load patterns are predictable from model structure, and a bespoke LRU cache hits more accurately than the OS’s general‑purpose page replacement. Going through the OS cache risks cache pollution by unrelated system‑wide I/O.
Avoiding mmap aligns with Flash‑MoE’s findings: mmap caused a 5× slowdown there. On Apple Silicon with unified memory, page‑fault overhead from mmap is heavy. For llama.cpp itself—a mmap‑centric design—this is fine when the whole model fits, but when it doesn’t, page faults occur during every token’s inference and latency becomes unstable. This matches the observation in the llama‑server experiment article that --no-mmap is essential on unified‑memory APUs.
Three inference modes and why dense‑FFN‑streaming is hard
The mode is selected automatically from model size and hardware capacity.
| Mode | When it applies | Tensors resident on GPU | NVMe transfers |
|---|---|---|---|
| full‑resident | Fits in GPU+RAM | Whole model | None |
| expert‑streaming | MoE model exceeds GPU | Non‑expert tensors | Load MoE experts on demand |
| dense‑FFN‑streaming | Dense model exceeds GPU | Attention + norm | Stream FFN layers |
expert‑streaming follows the same principle as Flash‑MoE: thanks to MoE sparsity (Mixture of Experts activates only part of the experts per input), the I/O volume shrinks dramatically. With Mixtral, only 2 of 8 experts are read—75% less. The neuron cache (LRU) achieves a 99.5% hit rate, and the aforementioned co‑activation prefetch reads likely experts from NVMe in parallel with GPU compute.
graph TD
A[GGUFモデルファイル] --> B[ハードウェアプロファイリング<br/>GPU working set / RAM / NVMe帯域]
B --> C[LP+greedy テンソル配置]
C --> D[GPU Metal<br/>Attention / norm / embedding]
C --> E[RAM<br/>オーバーフロー層]
C --> F[NVMe<br/>FFN / MoEエキスパート]
D --> G[推論実行]
E --> G
F -->|pread + F_NOCACHE| G
G --> H{MoE?}
H -->|Yes| I[LRUキャッシュ<br/>ヒット率99.5%]
H -->|No| J[280MBスクラッチバッファ<br/>ストリーミング]
I --> K[co-activation prefetch<br/>次レイヤーの候補を先読み]
dense‑FFN‑streaming is the area Flash‑MoE doesn’t cover, and this is Hypura’s distinct contribution. For dense models like Llama 70B where MoE sparsity can’t be used, all FFN gate/up/down tensors (~31.8 GB) must be read from NVMe every token.
Why this is inherently hard shows up in the numbers. For Mixtral 8x7B’s MoE streaming, each token needs I/O for 2/8 experts. For Llama 70B dense‑FFN‑streaming, every token reads FFN tensors across 80 layers. On an M1 Max with 5.1 GB/s NVMe bandwidth, that’s seconds of I/O per token by back‑of‑the‑envelope.
Hypura streams in 280 MB chunks into a scratch buffer so GPU compute and I/O can partially overlap. It also avoids committing 22 GB of anonymous mmap pages. Together these lifted throughput from 0.03 tok/s (34 layers on CPU) to 0.3 tok/s with all 80 layers on Metal—a 10× improvement.
That 0.3 tok/s is also today’s ceiling. The fact that M1 Max (5.1 GB/s NVMe) and M5 Pro (33.4 GB/s NVMe) perform almost the same suggests the bottleneck isn’t raw NVMe bandwidth but per‑layer I/O stalls—waiting for NVMe reads to complete, about 50 ms × 80 layers per token.
RESEARCH_INTEGRATION_PLAN.md outlines a plan to implement double‑buffering inspired by ntransformer’s SLEP pipeline: while GPU compute runs on buffer A, load the next layer into buffer B from NVMe, then swap roles when done. If implemented, this should fully overlap I/O stalls with GPU compute and substantially improve dense‑FFN‑streaming. It’s the same intuition as Flash‑MoE’s delayed GPU execution pattern, but the challenge is overcoming the shared memory‑controller constraint between SSD DMA and GPU compute on Apple Silicon’s unified bus (Flash‑MoE concluded a serial pipeline was optimal).
Benchmarks
Measured on M1 Max and M5 Pro.
| Model | Size | Mode | M1 Max 32GB | M5 Pro 24GB | llama.cpp |
|---|---|---|---|---|---|
| Qwen 2.5 14B Q4_K_M | 8.4GB | full‑resident | 12.3 tok/s | 27.2 tok/s | 8.9 tok/s |
| Qwen 2.5 32B Q5_K_M | 21.7GB | full‑resident | 6.6 tok/s | — | — |
| Mixtral 8x7B Q5_K_M | 30.9GB | expert‑streaming | 2.2 tok/s | 2.7 tok/s | OOM |
| Phi-3.5-MoE Q4_K_M | 23.6GB | expert‑streaming | — | 3.2 tok/s | — |
| Qwen3-Coder-Next Q4_K_M | 45.2GB | expert‑streaming | 1.3 tok/s | — | OOM |
| Llama 3.3 70B Q4_K_M | 39.6GB | dense‑FFN‑streaming | 0.3 tok/s | 0.3 tok/s | OOM |
Compared with Flash‑MoE on Qwen3.5‑397B (4.36 tok/s on M3 Max 48 GB), Hypura’s expert‑streaming for Mixtral 8x7B (2.2 tok/s on M1 Max 32 GB) looks weaker. But both the model size (209 GB vs. 30.9 GB) and hardware (M3 Max 48 GB vs. M1 Max 32 GB) differ, so it’s hard to compare directly. Hypura’s advantages are its integration with the GGUF ecosystem—use existing GGUF models as‑is—and its support for dense models.
0.3 tok/s isn’t suitable for real‑time dialog, but it is practical for background document summarization, batch jobs, or “checking 70B‑class output quality on your Mac.”
Installation requires Rust 1.75+ and CMake:
git clone --recurse-submodules https://github.com/t8/hypura.git
cd hypura
cargo build --release
Its Ollama‑compatible HTTP server (hypura serve) exposes /api/generate, /api/chat, and /api/tags, so you can plug it straight into frontends that speak the Ollama protocol, such as Open WebUI.
TurboQuant: 3‑bit KV‑cache compression for up to 8× server throughput
When KV‑cache compression actually becomes a bottleneck
“KV‑cache quantization” may sound essential for running large models locally, but for local inference it rarely matters.
KV‑cache size is determined by:
2(K+V) × バッチサイズ × レイヤー数 × KVヘッド数 × ヘッド次元 × コンテキスト長 × 精度バイト数
Llama 3.1 8B uses GQA (Grouped Query Attention, sharing KV heads across multiple query heads). With 32 query heads and only 8 KV heads, FP16, 8K context, and a single user:
2 × 1 × 32層 × 8ヘッド × 128次元 × 8192トークン × 2バイト ≈ 1.1GB
Add the model itself (~4.5 GB for Q4_K_M) for about 5.6 GB total. That fits comfortably even on a 16 GB Mac. Stretching to a 128K context is ~17 GB, which a 32 GB+ Mac can handle.
KV‑cache pressure becomes serious in cases like:
- 64 simultaneous users on a server → 1.1 GB × 64 = 70 GB (KV alone)
- Gemini‑class 1M‑token context windows → hundreds of GB
- Both at once in a data‑center setting
It’s no accident the TurboQuant paper explicitly mentions Gemini as an application. It’s Google Research work to raise throughput in environments that handle thousands of concurrent requests. For a single user running an 8B model on Ollama/LM Studio, it’s irrelevant.
Still, the algorithms are original and applicable beyond KV caches.
The overhead problem with conventional quantization constants
As covered in the KV‑compression article, Attention Matching compresses by choosing which tokens’ KVs to keep. TurboQuant compresses by lowering the bit‑width of each token’s KV vector.
Conventional quantization methods (KIVI, KVQuant, GEAR, etc.) must store quantization parameters per small data block.
量子化されたブロック = [量子化値 × N] + [スケール(FP16)] + [ゼロポイント(FP16)]
For example, with group size 128 and 2‑bit quantization, 128 two‑bit values (32 bytes) carry an extra 4 bytes of quantization constants. The effective bit‑width isn’t 2 but about 2.25 bits. The fewer the bits, the larger this overhead ratio becomes, offsetting compression.
Even KIVI, which exploits the asymmetric compressibility of Keys vs. Values discussed in the KV‑compression article (Keys can drop to INT2; Values directly affect accuracy), cannot avoid this overhead.
TurboQuant combines two algorithms—PolarQuant and QJL—to eliminate quantization‑constant overhead itself.
PolarQuant: make quantization constants unnecessary via polar coordinates
PolarQuant’s core idea is to transform vectors from Cartesian to polar coordinates so the distribution becomes quantization‑friendly.
In standard quantization, because each element’s value range varies, you must store scale and zero‑point per block. PolarQuant breaks that assumption.
Converting a d‑dimensional vector to polar coordinates yields one radius r and (d‑1) angles θ. The radius encodes magnitude (norm); the angles encode direction (semantic information).
Crucially, the angular components’ distribution in high dimensions concentrates near π/2. This concentration of measure on the high‑dimensional sphere grows stronger with dimension. KV vectors in LLMs aren’t purely random, but with 128–256 dimensions angular distributions concentrate enough.
With a concentrated and known‑in‑advance range, a uniform quantization grid can be applied to all elements without storing per‑block scale/zero‑point. That’s how PolarQuant achieves “zero overhead for quantization constants.”
The transform is applied recursively. Pair up coordinates and convert to polar to get d/2 radii and d/2 angles. Gather the radii and apply the same transform again. Eventually you distill one scalar radius and many angle values.
graph TD
A[d次元ベクトル<br/>デカルト座標] --> B[座標ペアごとに<br/>極座標変換]
B --> C[d/2個の半径<br/>d/2個の角度]
C --> D[半径群を再帰的に変換]
D --> E[1個の半径 + 多数の角度]
E --> F[半径: FP16で保持<br/>2バイト/ベクトル]
E --> G[角度: 均一量子化<br/>定数不要]
F --> H[復元時に掛け合わせ]
G --> H
Per vector, the only overhead is one FP16 radius (2 bytes). For 128‑dimensional vectors quantized to 3 bits, conventional methods with group size 8 would need 32 pairs of constants (128 bytes), whereas PolarQuant needs just 2 bytes. The compression gap is stark.
PolarQuant is accepted to AISTATS 2026 (arxiv:2502.02617) and stands as independent work on its own.
QJL: correct residuals with 1 bit via a Johnson–Lindenstrauss transform
PolarQuant alone leaves quantization error: the residual between the original vector and the quantized‑then‑reconstructed vector. QJL (Quantized Johnson–Lindenstrauss) corrects this residual with essentially zero additional memory overhead.
The Johnson–Lindenstrauss (JL) lemma underpins dimensionality reduction: projecting high‑dimensional points with a random matrix into a lower dimension preserves pairwise distances within ε accuracy. The target dimension is O(log(n)/ε²) and does not depend on the original dimension. JL is widely used in embedding retrieval and approximate nearest‑neighbor search.
TurboQuant uses this random projection to map the PolarQuant residual vector to a lower dimension, then stores only the sign (+1/−1) of each projected component.
graph TD
A[PolarQuant残差<br/>e = x - x_quantized] --> B[ランダム行列Rで射影<br/>z = R × e]
B --> C[符号だけ保持<br/>s_i = sign of z_i]
C --> D[+1/-1のビット列<br/>1ビット/次元]
E[クエリベクトルq<br/>FP16のまま] --> F[同じRで射影<br/>z_q = R × q]
D --> G[補正付きアテンション計算<br/>score = PolarQuant項 + QJL補正項]
F --> G
At attention‑score time, the query vector q remains FP16 and is not quantized. You take the inner product between the PolarQuant‑quantized KV vectors and q, then add a QJL correction term. This term—computed from the high‑precision query and the 1‑bit sign codes—is designed so its expectation exactly compensates for the inner‑product error due to quantization. The paper proves this unbiased estimator formally.
The random projection matrix R is generated once at the start of inference and shared across layers and heads. The only stored data per KV vector is its sign bitstring, so the extra memory is 1 bit per dimension: for d=128, that’s 16 bytes per vector. Together with PolarQuant’s 2‑byte radius, you get 18 bytes per vector—a 93% reduction from FP16’s 256 bytes per vector.
QJL is published at AAAI and also stands on its own as independent research.
TurboQuant’s two‑stage pipeline
Putting the two together looks like this:
graph TD
A[KVキャッシュベクトル<br/>FP16] --> B[PolarQuant<br/>極座標変換 + 均一量子化]
B --> C[量子化KV<br/>3ビット/要素 + 半径FP16]
A --> D[残差計算<br/>e = 元 - 復元]
D --> E[QJL<br/>ランダム射影 + 符号化]
E --> F[符号ビット列<br/>1ビット/次元]
C --> G[アテンション計算]
F --> G
H[クエリq<br/>FP16] --> G
G --> I[補正済みアテンションスコア]
You only quantize the new token’s KV on arrival; previously quantized KVs don’t need to be recomputed. This online property is a strength compared with batch compression methods like Attention Matching. No fine‑tuning is required; you can apply it to existing models as‑is.
Positioning among existing KV‑compression methods
There are three broad approaches to KV‑cache compression.
| Approach | Representative methods | Compression method | Compression ratio | Training | Online applicability |
|---|---|---|---|---|---|
| Token selection/merging | H2O, SnapKV, Attention Matching | Keep only important tokens | Up to 50× | Not required | Partial |
| Quantization | KIVI, KVQuant, TurboQuant | Reduce bit‑width | Around 5× | Not required | Supported |
| Latent‑space optimization | Cartridges | Learn compressed KVs with gradients | 50× | Required (hours) | Not supported |
Attention Matching (see the previous article) achieves 50× compression with Cartridge‑level accuracy in seconds via a closed‑form solution. However, it assumes batch processing—prefill the entire context and then compress—so it’s not designed for online application where tokens keep arriving mid‑conversation (the paper only mentions this briefly in the appendix).
TurboQuant reduces 16‑bit FP16 to 3 bits for about 5.3× compression. It can’t match Attention Matching’s 50×, but its strength is ease of online application: you quantize as tokens arrive, integrating it directly into the inference pipeline.
The difference from KIVI is stark: whether you pay the overhead of quantization constants determines the effective bit‑width. KIVI’s INT2 is effectively ~2.25 bits and quality degrades. TurboQuant’s 3 bits are truly 3 bits—with no accuracy loss in the experiments.
Benchmarks
Evaluated on Gemma, Mistral, and Llama‑3.1‑8B‑Instruct with LongBench, Needle‑in‑a‑Haystack, ZeroSCROLLS, RULER, and L‑Eval.
| Metric | Value |
|---|---|
| Quantization bit‑width | 3 bits (effective bit‑width also ≈3 bits) |
| Accuracy degradation | None (no fine‑tuning required) |
| KV‑cache memory reduction | Up to 6× (Needle‑in‑a‑Haystack) |
| Inference throughput | Up to 8× (H100, 4‑bit vs. 32‑bit) |
| LongBench aggregate | Beats KIVI baseline |
Needle‑in‑a‑Haystack places a specific fact inside a very long context, making it highly sensitive to KV‑cache compression. Preserving accuracy while cutting memory 6× here implies PolarQuant’s angular‑concentration‑aware quantization keeps the attention‑score ranking intact. This aligns with the property explained in the KV‑compression article: “as long as Key ranking is preserved, softmax absorbs the rest.”
On vector‑search evaluation (GloVe d=200, Top‑1 recall), TurboQuant consistently outperformed Product Quantization (PQ) and RabbiQ. The algorithms apply beyond KV caches—to semantic search engines and vector DB quantization. The design is “provably efficient,” operating near theoretical lower bounds, and it is data‑oblivious, so it works without pre‑tuning to a specific dataset.
The paper is accepted to ICLR 2026 (arxiv:2504.19874); PolarQuant to AISTATS 2026 (arxiv:2502.02617); and QJL appears at AAAI.
Different layers of problems they solve
Hypura and TurboQuant target entirely different users.
Hypura serves local‑LLM users who “want to run big models on their own Mac.” It rides the llama.cpp GGUF ecosystem and exposes an Ollama‑compatible API, so it plugs straight into existing frontends. Unlike Flash‑MoE, its dense‑FFN‑streaming supports dense models like Llama 70B—that’s its unique value. Following Flash‑MoE, this is a second example of exploiting Apple Silicon’s NVMe bandwidth as an inference resource.
TurboQuant aims to “cut inference cost for Gemini‑class models in data centers.” For local single‑user 8B workloads, the KV cache is around 1 GB and doesn’t need compression. But PolarQuant’s polar‑coordinate transform and QJL’s 1‑bit sign correction are general quantization ideas; as the benchmarks show, they even beat PQ and RabbiQ on vector retrieval.