SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders
I previously introduced the TurboQuant KV‑cache quantization algorithm, but that was a theoretical walkthrough of the Google Research paper. SwiftLM implements TurboQuant natively in C++/Metal shaders and combines it with SSD expert streaming in a working Swift inference server. No Python runtime, no GIL — it runs directly on Apple Silicon via MLX. It even ships with an iOS app.
Overview of SwiftLM
SwiftLM is a 100% native Swift LLM inference server that provides an OpenAI‑compatible API (/v1/chat/completions). Under the hood it uses Apple’s MLX framework (mlx‑swift) and the Swift Hummingbird HTTP server.
There are two main features:
- On‑the‑fly KV‑cache compression via a TurboQuant V2+V3 hybrid (~3.5×)
- Zero‑copy streaming of MoE expert layers from NVMe SSD to the GPU command buffer
The target hardware is a MacBook Pro M5 Pro (64 GB unified memory), designed around running Qwen3.5‑122B‑A10B‑4bit. On iOS, the app runs Qwen3 1.7B on an iPhone 13 Pro (6 GB).
graph TD
A[HuggingFaceモデル<br/>Safetensors] --> B[SwiftLM Server<br/>macOS]
B --> C[MLX Swift<br/>Metal GPU]
C --> D[TurboQuant V2+V3<br/>KVキャッシュ圧縮]
C --> E[SSD Expert Streaming<br/>NVMeゼロコピー]
B --> F[OpenAI互換API<br/>localhost:5413]
G[SwiftLM Chat<br/>iOS App] --> H[MLX Swift<br/>オンデバイス推論]
H --> I[iPhone/iPad<br/>Metal GPU]
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e3f2fd,stroke:#2196f3
TurboQuant V2+V3 Hybrid Implementation
In the previous article I explained TurboQuant’s theory (PolarQuant’s polar transform plus QJL’s 1‑bit sign correction). What’s interesting about SwiftLM is how it fuses “the speed of V2 and the quality of V3” to make the theory practical on a Metal GPU.
Open‑source TurboQuant implementations have largely split into two approaches:
| Variant | Quantization method | Speed | Quality |
|---|---|---|---|
| V2 (hardware‑accelerated) | Linear affine quantization | Fast (uses MLX quantized_matmul) | Noticeable degradation at 3‑bit |
| V3 (paper‑faithful) | Nonlinear Lloyd‑Max codebook | Slow (software dequantization) | High quality |
V2, as adopted by projects like turboquant-mlx, can leverage MLX’s hardware‑accelerated matrix ops so it’s fast, but linear (affine) quantization shows its limits at 3‑bit. V3 computes an optimal nonlinear codebook with the Lloyd‑Max algorithm, so quality is high, but dequantization has to run in software and is slow.
SwiftLM ports the V3 Lloyd‑Max codebook into a native C++ encoding path and performs dequantization with a fused Metal shader (bggml-metal). In other words, it keeps V3‑level quality while dequantizing at GPU speed.
What is a Lloyd‑Max codebook?
Lloyd‑Max is a scalar quantization optimization method proposed in the 1960s that iteratively finds quantization levels minimizing mean squared error (MSE) for a given input distribution. Unlike affine quantization, which rounds on an evenly spaced grid, Lloyd‑Max places more levels where data density is high and fewer where it’s low.
For TurboQuant, theory shows that the coordinates after PolarQuant’s angular transform follow a standard normal distribution N(0, 1/d). Because that distribution is known a priori, you can precompute an 8‑level (3‑bit) Lloyd‑Max codebook once before inference. A key benefit of TurboQuant is that no data‑dependent calibration is needed, but because the codebook itself is nonlinear, GPU dequantization requires a custom shader.
KV‑cache compression breakdown
SwiftLM uses different strategies for the K (Key) and V (Value) caches.
The K‑cache is quantized at 4.25 bits per dimension.
- Extract the L2 norm and normalize
- Apply a Fast Walsh‑Hadamard Transform (WHT) to spread outliers evenly
- Quantize with a 3‑bit nonlinear Lloyd‑Max centroid
- Compute the residual between the original vector and the quantized approximation
- Project the residual with a random Johnson–Lindenstrauss (QJL) matrix and keep a 1‑bit sign
The V‑cache is 3.125 bits per dimension. Because it isn’t used for dot‑product attention scoring, QJL’s error correction doesn’t help; by disabling QJL for the V‑cache, SwiftLM squeezes another 25% memory reduction without harming quality.
Taken together this is about 3.6 bits/dim — roughly 3.5× compression versus FP16. As covered in the earlier article, PolarQuant’s angular concentration makes the overhead for quantization constants (scale/zero‑point) essentially zero, so the effective bit‑width closely matches the nominal value. This contrasts with KIVI’s INT2, which inflates to about 2.25 effective bits.
Role of the Walsh‑Hadamard Transform (WHT)
WHT is applied before quantizing the K‑cache and acts as a rotation that evenly spreads out outliers across dimensions. LLM attention KV vectors tend to concentrate energy in a few channels; if you quantize as‑is, those channels suffer large errors.
WHT is an orthogonal transform built from additions and subtractions, with a divide‑and‑conquer structure similar to FFT. Because it requires no multiplications, it lives up to the “Fast” in its name. In SwiftLM it’s implemented as C kernels in turbo-wht.h, ported from TheTom/llama-cpp-turboquant.
SSD Expert Streaming
The second core feature of SwiftLM is zero‑copy streaming of MoE expert layers from NVMe SSD to the GPU command buffer. Enable it with the --stream-experts flag.
Why it’s needed
Qwen3.5‑122B‑A10B is a 122‑billion‑parameter MoE model; even at 4‑bit, the full weights are roughly 60 GB. On a system with 64 GB unified memory, adding the KV cache and OS working set pushes memory to the brink. When macOS unified memory exceeds the physical limit, virtual memory swapping kicks in; in the worst case the Watchdog detects a GPU hang and triggers a kernel panic.
Historically, people left this to macOS’s virtual memory and let it swap. But the resulting SSD I/O pattern isn’t aligned with the inference pipeline, causing frequent random page faults — the same issue seen in the Flash‑MoE article where mmap made performance 5× worse.
How zero‑copy streaming works
SwiftLM exploits MoE sparsity. In Qwen3.5‑122B‑A10B each layer has multiple experts, and for any given token only the top K=4 are active. Non‑expert parts (attention, normalization, router, etc.) stay resident in GPU memory; only the expert weights are read from NVMe on demand.
graph TD
A[推論リクエスト] --> B[非エキスパート層<br/>GPU常駐]
B --> C[MoEルーター<br/>上位K=4を選択]
C --> D[NVMe SSD<br/>エキスパート重みファイル]
D -->|ゼロコピー読み込み| E[GPUコマンドバッファ<br/>直接転送]
E --> F[Metal GPU<br/>エキスパート順伝播]
F --> G[出力トークン]
style D fill:#fff3e0,stroke:#ff9800
style E fill:#e3f2fd,stroke:#2196f3
“Zero‑copy” means the data read from SSD is passed straight into the GPU command buffer without detouring through a user‑space buffer. Typical file I/O goes SSD → OS kernel buffer → user‑space buffer → GPU transfer; SwiftLM cuts out the intermediate copies. Apple Silicon’s Unified Memory Architecture (shared physical memory between CPU/GPU/NPU) makes this practical.
Design comparison with Flash‑MoE and Hypura
Following Flash‑MoE and Hypura, this is the third project that leverages Apple Silicon’s NVMe bandwidth for LLM inference.
| Flash‑MoE | Hypura | SwiftLM | |
|---|---|---|---|
| Language | C + hand‑written Metal shaders | Rust + llama.cpp fork | Swift + MLX |
| Model format | Safetensors | GGUF | HuggingFace Safetensors |
| Target models | MoE only | MoE + Dense | MoE only |
| KV‑cache compression | None | None | TurboQuant V2+V3 |
| SSD I/O method | Parallel pread() | pread() + F_NOCACHE | Zero‑copy GPU buffer transfer |
| Caching | OS page cache (71%) | Custom LRU (99.5%) | Not specified |
| API | None | Ollama‑compatible | OpenAI‑compatible |
| Python dependency | None | None | None |
| iOS support | No | No | Yes |
Flash‑MoE demonstrated 4.36 tokens/s on a 397B‑parameter model running on a 48 GB M3 Max. Hypura’s strengths were a 99.5% cache hit‑rate and support for dense models. SwiftLM’s differentiation is the integration with TurboQuant for KV‑cache compression. By streaming experts and compressing KV by ~3.5×, it avoids memory pressure even for long‑context inference.
That said, SwiftLM’s README doesn’t publish clear tokens/s benchmarks. There is a Hacker News report of 2,694 MB active GPU VRAM, but speed comparisons are thinner than for Flash‑MoE or Hypura.
Constraints of 4‑bit quantization
SwiftLM’s README states that 4‑bit is the production standard for MoE models. With 2‑bit quantization, JSON syntax becomes unstable and "name" turns into \name\, which breaks OpenAI‑compatible tool calls.
This matches the Flash‑MoE experiments exactly. The Flash‑MoE article reported the same "name" → \name\ symptom under 2‑bit and recommended a 4‑bit setup. Two independent projects seeing the same failure suggests a structural limit to 2‑bit quantization for MoE models.
iOS support: SwiftLM Chat
SwiftLM Chat is a native iPhone/iPad app that downloads MLX models directly from HuggingFace and runs them on‑device. On an iPhone 13 Pro (6 GB) it runs Qwen3 1.7B.
It uses the same mlx‑swift as the server, but SSD Expert Streaming and TurboQuant are not included in the iOS build (iOS storage I/O and memory differ from macOS). The model catalog includes Qwen3, Phi‑3.5, Mistral, and Llama, with an indicator showing whether the model fits in the device’s RAM.
With no Python and no server, finishing inference purely on the Metal GPU is a very clean architecture for iOS LLMs.
The metallib dependency trap
There’s a setup pitfall to watch for. Metal GPU kernels are packaged in a pre‑compiled binary named default.metallib, which is vendored at a fixed version inside the mlx‑swift submodule.
The README warns in bold: if you copy mlx.metallib from Python’s mlx-metal pip package, the server starts but will crash during inference with a GPU kernel ABI mismatch (freed pointer was not the last allocation). mlx-metal (Python) and mlx-swift are in the same MLX family, but because they’re built from different versions, the Metal kernel binary interfaces don’t match.
# 正しい方法: サブモジュール内のmetallibを使う
cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \
.build/release/
# 間違った方法: pip mlx-metalから取ってくる(クラッシュする)
# uv run --with mlx-metal python -c "...shutil.copy(metallib, ...)"
The MLX ecosystem has multiple language bindings — Python (mlx), Swift (mlx‑swift), and C/C++ — each on its own release cycle. As noted in the Ollama MLX backend migration, versioning is a recurring challenge; in SwiftLM the impact is worse because incompatibility happens at the Metal‑kernel binary level.
How to run
# ソースからビルド
git clone --recursive https://github.com/SharpAI/SwiftLM
cd SwiftLM
swift build -c release
# metallibをバイナリと同じディレクトリにコピー
cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \
.build/release/
# 122B MoEモデルをSSDストリーミングで実行
.build/release/SwiftLM \
--model mlx-community/Qwen3.5-122B-A10B-4bit \
--stream-experts \
--port 5413
--stream-experts is crucial; without it, the 122B model will consume unified memory and risk a kernel panic. You can also use the --gpu-layers option to cap how many layers run on the GPU.
Because the API is OpenAI‑compatible, existing SDKs and tools can connect as‑is.
curl http://localhost:5413/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-122B-A10B-4bit",
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'
Requirements: macOS 14.0 or later, Apple Silicon (M1 or newer), and Xcode Command Line Tools. Install the Metal Toolchain with xcodebuild -downloadComponent MetalToolchain.
Maturity and outlook
SwiftLM was built in 1–2 weeks as the ML inference backend for SharpAI’s local video security product (SharpAI Aegis). On Hacker News, some call it “vibe coding,” and the project lacks comprehensive benchmarks. Compared to Flash‑MoE, which published 58 experiment logs, the validation depth is thinner.
Even so, making TurboQuant’s theory run in native Metal and integrating SSD streaming plus an OpenAI‑compatible API is a well‑executed prototype. The rest depends on benchmarks.