SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders
Contents
Update (2026-04-23): Fixed broken internal links.
Update (2026-04-24): After writing this I actually built SwiftLM on an M1 Max 64GB, ran 122B-A10B-4bit with --stream-experts, and benchmarked 35B-A3B against Ollama and MLX-lm. The follow-up is at Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm. The most interesting finding there is that ssd_stream=true is declared but SSD I/O never actually fires for short prompts — the implementation details are easier to see once you pair the two articles.
I previously introduced the TurboQuant KV‑cache quantization algorithm, but that was a theoretical walkthrough of the Google Research paper. SwiftLM implements TurboQuant natively in C++/Metal shaders and combines it with SSD expert streaming in a working Swift inference server. No Python runtime, no GIL — it runs directly on Apple Silicon via MLX. It even ships with an iOS app.
Overview of SwiftLM
SwiftLM is a 100% native Swift LLM inference server that provides an OpenAI‑compatible API (/v1/chat/completions). Under the hood it uses Apple’s MLX framework (mlx‑swift) and the Swift Hummingbird HTTP server.
There are two main features:
- On‑the‑fly KV‑cache compression via a TurboQuant V2+V3 hybrid (~3.5×)
- Zero‑copy streaming of MoE expert layers from NVMe SSD to the GPU command buffer
The target hardware is a MacBook Pro M5 Pro (64 GB unified memory), designed around running Qwen3.5‑122B‑A10B‑4bit. On iOS, the app runs Qwen3 1.7B on an iPhone 13 Pro (6 GB).
graph TD
A[HuggingFaceモデル<br/>Safetensors] --> B[SwiftLM Server<br/>macOS]
B --> C[MLX Swift<br/>Metal GPU]
C --> D[TurboQuant V2+V3<br/>KVキャッシュ圧縮]
C --> E[SSD Expert Streaming<br/>NVMeゼロコピー]
B --> F[OpenAI互換API<br/>localhost:5413]
G[SwiftLM Chat<br/>iOS App] --> H[MLX Swift<br/>オンデバイス推論]
H --> I[iPhone/iPad<br/>Metal GPU]
style D fill:#e8f5e9,stroke:#4caf50
style E fill:#e3f2fd,stroke:#2196f3
TurboQuant V2+V3 Hybrid Implementation
In Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization I explained TurboQuant’s theory (PolarQuant’s polar transform plus QJL’s 1‑bit sign correction). What’s interesting about SwiftLM is how it fuses “the speed of V2 and the quality of V3” to make the theory practical on a Metal GPU.
Open‑source TurboQuant implementations have largely split into two approaches:
| Variant | Quantization method | Speed | Quality |
|---|---|---|---|
| V2 (hardware‑accelerated) | Linear affine quantization | Fast (uses MLX quantized_matmul) | Noticeable degradation at 3‑bit |
| V3 (paper‑faithful) | Nonlinear Lloyd‑Max codebook | Slow (software dequantization) | High quality |
V2, as adopted by projects like turboquant-mlx, can leverage MLX’s hardware‑accelerated matrix ops so it’s fast, but linear (affine) quantization shows its limits at 3‑bit. V3 computes an optimal nonlinear codebook with the Lloyd‑Max algorithm, so quality is high, but dequantization has to run in software and is slow.
SwiftLM ports the V3 Lloyd‑Max codebook into a native C++ encoding path and performs dequantization with a fused Metal shader (bggml-metal). In other words, it keeps V3‑level quality while dequantizing at GPU speed.
What is a Lloyd‑Max codebook?
Lloyd‑Max is a scalar quantization optimization method proposed in the 1960s that iteratively finds quantization levels minimizing mean squared error (MSE) for a given input distribution. Unlike affine quantization, which rounds on an evenly spaced grid, Lloyd‑Max places more levels where data density is high and fewer where it’s low.
For TurboQuant, theory shows that the coordinates after PolarQuant’s angular transform follow a standard normal distribution N(0, 1/d). Because that distribution is known a priori, you can precompute an 8‑level (3‑bit) Lloyd‑Max codebook once before inference. A key benefit of TurboQuant is that no data‑dependent calibration is needed, but because the codebook itself is nonlinear, GPU dequantization requires a custom shader.
KV‑cache compression breakdown
SwiftLM uses different strategies for the K (Key) and V (Value) caches.
The K‑cache is quantized at 4.25 bits per dimension.
- Extract the L2 norm and normalize
- Apply a Fast Walsh‑Hadamard Transform (WHT) to spread outliers evenly
- Quantize with a 3‑bit nonlinear Lloyd‑Max centroid
- Compute the residual between the original vector and the quantized approximation
- Project the residual with a random Johnson–Lindenstrauss (QJL) matrix and keep a 1‑bit sign
The V‑cache is 3.125 bits per dimension. Because it isn’t used for dot‑product attention scoring, QJL’s error correction doesn’t help; by disabling QJL for the V‑cache, SwiftLM squeezes another 25% memory reduction without harming quality.
Taken together this is about 3.6 bits/dim — roughly 3.5× compression versus FP16. As covered in Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization, PolarQuant’s angular concentration makes the overhead for quantization constants (scale/zero‑point) essentially zero, so the effective bit‑width closely matches the nominal value. This contrasts with KIVI’s INT2, which inflates to about 2.25 effective bits.
Role of the Walsh‑Hadamard Transform (WHT)
WHT is applied before quantizing the K‑cache and acts as a rotation that evenly spreads out outliers across dimensions. LLM attention KV vectors tend to concentrate energy in a few channels; if you quantize as‑is, those channels suffer large errors.
WHT is an orthogonal transform built from additions and subtractions, with a divide‑and‑conquer structure similar to FFT. Because it requires no multiplications, it lives up to the “Fast” in its name. In SwiftLM it’s implemented as C kernels in turbo-wht.h, ported from TheTom/llama-cpp-turboquant.
SSD Expert Streaming
The second core feature of SwiftLM is zero‑copy streaming of MoE expert layers from NVMe SSD to the GPU command buffer. Enable it with the --stream-experts flag.
Why it’s needed
Qwen3.5‑122B‑A10B is a 122‑billion‑parameter MoE model; even at 4‑bit, the full weights are roughly 60 GB. On a system with 64 GB unified memory, adding the KV cache and OS working set pushes memory to the brink. When macOS unified memory exceeds the physical limit, virtual memory swapping kicks in; in the worst case the Watchdog detects a GPU hang and triggers a kernel panic.
Historically, people left this to macOS’s virtual memory and let it swap. But the resulting SSD I/O pattern isn’t aligned with the inference pipeline, causing frequent random page faults — the same issue seen in the Flash‑MoE article where mmap made performance 5× worse.
How zero‑copy streaming works
SwiftLM exploits MoE sparsity. In Qwen3.5‑122B‑A10B each layer has 256 experts, and for any given token only the top K=8 are active (num_experts_per_tok: 8). Non‑expert parts (attention, normalization, router, etc.) stay resident in GPU memory; only the expert weights are read from NVMe on demand.
graph TD
A[推論リクエスト] --> B[非エキスパート層<br/>GPU常駐]
B --> C[MoEルーター<br/>上位K=8を選択]
C --> D[NVMe SSD<br/>エキスパート重みファイル]
D -->|ゼロコピー読み込み| E[GPUコマンドバッファ<br/>直接転送]
E --> F[Metal GPU<br/>エキスパート順伝播]
F --> G[出力トークン]
style D fill:#fff3e0,stroke:#ff9800
style E fill:#e3f2fd,stroke:#2196f3
“Zero‑copy” means the data read from SSD is passed straight into the GPU command buffer without detouring through a user‑space buffer. Typical file I/O goes SSD → OS kernel buffer → user‑space buffer → GPU transfer; SwiftLM cuts out the intermediate copies. Apple Silicon’s Unified Memory Architecture (shared physical memory between CPU/GPU/NPU) makes this practical.
Design comparison with Flash‑MoE and Hypura
Following Flash‑MoE and Hypura, this is the third project that leverages Apple Silicon’s NVMe bandwidth for LLM inference.
| Flash‑MoE | Hypura | SwiftLM | |
|---|---|---|---|
| Language | C + hand‑written Metal shaders | Rust + llama.cpp fork | Swift + MLX |
| Model format | Safetensors | GGUF | HuggingFace Safetensors |
| Target models | MoE only | MoE + Dense | MoE only |
| KV‑cache compression | None | None | TurboQuant V2+V3 |
| SSD I/O method | Parallel pread() | pread() + F_NOCACHE | Zero‑copy GPU buffer transfer |
| Caching | OS page cache (71%) | Custom LRU (99.5%) | Not specified |
| API | None | Ollama‑compatible | OpenAI‑compatible |
| Python dependency | None | None | None |
| iOS support | No | No | Yes |
Flash‑MoE demonstrated 4.36 tokens/s on a 397B‑parameter model running on a 48 GB M3 Max. Hypura’s strengths were a 99.5% cache hit‑rate and support for dense models. SwiftLM’s differentiation is the integration with TurboQuant for KV‑cache compression. By streaming experts and compressing KV by ~3.5×, it avoids memory pressure even for long‑context inference.
That said, SwiftLM’s README doesn’t publish clear tokens/s benchmarks. There is a Hacker News report of 2,694 MB active GPU VRAM, but speed comparisons are thinner than for Flash‑MoE or Hypura.
Constraints of 4‑bit quantization
SwiftLM’s README states that 4‑bit is the production standard for MoE models. With 2‑bit quantization, JSON syntax becomes unstable and "name" turns into \name\, which breaks OpenAI‑compatible tool calls.
This matches the Flash‑MoE experiments exactly. The Flash‑MoE article reported the same "name" → \name\ symptom under 2‑bit and recommended a 4‑bit setup. Two independent projects seeing the same failure suggests a structural limit to 2‑bit quantization for MoE models.
iOS support: SwiftLM Chat
SwiftLM Chat is a native iPhone/iPad app that downloads MLX models directly from HuggingFace and runs them on‑device. On an iPhone 13 Pro (6 GB) it runs Qwen3 1.7B.
It uses the same mlx‑swift as the server, but SSD Expert Streaming and TurboQuant are not included in the iOS build (iOS storage I/O and memory differ from macOS). The model catalog includes Qwen3, Phi‑3.5, Mistral, and Llama, with an indicator showing whether the model fits in the device’s RAM.
With no Python and no server, finishing inference purely on the Metal GPU is a very clean architecture for iOS LLMs.
The metallib dependency trap
There’s a setup pitfall to watch for. Metal GPU kernels are packaged in a pre‑compiled binary named default.metallib, which is vendored at a fixed version inside the mlx‑swift submodule.
The README warns in bold: if you copy mlx.metallib from Python’s mlx-metal pip package, the server starts but will crash during inference with a GPU kernel ABI mismatch (freed pointer was not the last allocation). mlx-metal (Python) and mlx-swift are in the same MLX family, but because they’re built from different versions, the Metal kernel binary interfaces don’t match.
# What actually works: grab mlx.metallib from the official Release tarball
gh release download b554 --repo SharpAI/SwiftLM \
--pattern "SwiftLM-b554-macos-arm64.tar.gz"
tar -xzf SwiftLM-b554-macos-arm64.tar.gz
cp mlx.metallib .build/release/mlx.metallib
# Wrong approach: copying from pip mlx-metal (causes crash)
# uv run --with mlx-metal python -c "...shutil.copy(metallib, ...)"
The expected filename is mlx.metallib, not default.metallib. If you leave it named default.metallib, the server dies with “library not found” on load. Details in Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm.
The MLX ecosystem has multiple language bindings — Python (mlx), Swift (mlx‑swift), and C/C++ — each on its own release cycle. As noted in the Ollama MLX backend migration, versioning is a recurring challenge; in SwiftLM the impact is worse because incompatibility happens at the Metal‑kernel binary level.
How to run
# Build from source (use --product SwiftLM to skip the SwiftBuddy compile error)
git clone --recursive https://github.com/SharpAI/SwiftLM
cd SwiftLM
swift build -c release --product SwiftLM
# Grab mlx.metallib from the official Release
gh release download b554 --repo SharpAI/SwiftLM \
--pattern "SwiftLM-b554-macos-arm64.tar.gz"
tar -xzf SwiftLM-b554-macos-arm64.tar.gz
cp mlx.metallib .build/release/mlx.metallib
# Run the 122B MoE model with SSD streaming
.build/release/SwiftLM \
--model mlx-community/Qwen3.5-122B-A10B-4bit \
--stream-experts \
--port 5413
--stream-experts is crucial; without it, the 122B model will consume unified memory and risk a kernel panic. You can also use the --gpu-layers option to cap how many layers run on the GPU.
Because the API is OpenAI‑compatible, existing SDKs and tools can connect as‑is.
curl http://localhost:5413/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-122B-A10B-4bit",
"stream": true,
"messages": [
{"role": "user", "content": "Hello"}
]
}'
Requirements: macOS 14.0 or later, Apple Silicon (M1 or newer), and Xcode Command Line Tools. Install the Metal Toolchain with xcodebuild -downloadComponent MetalToolchain.
Maturity and outlook
SwiftLM was built in 1–2 weeks as the ML inference backend for SharpAI’s local video security product (SharpAI Aegis). On Hacker News, some call it “vibe coding,” and the project lacks comprehensive benchmarks. Compared to Flash‑MoE, which published 58 experiment logs, the validation depth is thinner.
Even so, making TurboQuant’s theory run in native Metal and integrating SSD streaming plus an OpenAI‑compatible API is a well‑executed prototype. The rest depends on benchmarks.
Update (2026-04-25): corrections from the actual runs
After the initial draft I built and ran SwiftLM end-to-end. Both Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm and Running a Non-Qwen MoE on SwiftLM: Ling-flash-2.0 MXFP4 on M1 Max 64GB exposed several points that need correcting in this piece.
- “Target models: MoE only” is inaccurate. mlx-swift-lm’s
Models/directory has over 50 architectures ready to go (Llama, Gemma, Mistral, Cohere, DeepseekV3, BailingMoe, GLM4, Olmo, and more). Dense models run fine. Only the SSD Expert Streaming feature is MoE-specific. - “SSD I/O: Zero-copy GPU buffer transfer” is design intent, not observed behavior. Reading the source, the implementation is straightforward
dispatch_ioplus POSIXpread, structurally in the same family as Flash-MoE’s “parallel pread”. More importantly, measurements showswiftlm_ssd_bytes_read_totalremains 0 at the end of inference even whenssd_stream=trueis set. Apple Silicon’s unified memory plus mmap plus the OS page cache mean SSD I/O never fires in practice. - “Caching: Not specified” needs an update. The actual behavior is OS page cache via mmap. Classify it alongside Flash-MoE.
- TurboQuant default behavior. With
--stream-experts, the startup log printsturbo_kv=disabled. KV-cache compression is opt-in and isn’t on by default, even though the README presents it as a headline feature.
Put simply, this article walks through SwiftLM’s design intent and README claims as a theoretical piece, and diverges from observed implementation behavior in several places. The two hands-on articles carry the measured numbers; read all three together for the full picture.