SwiftLM is a Swift-based LLM inference server that integrates TurboQuant and SSD streaming into Metal shaders

Update (2026-04-23): Fixed broken internal links.

Update (2026-04-24): After writing this I actually built SwiftLM on an M1 Max 64GB, ran 122B-A10B-4bit with --stream-experts, and benchmarked 35B-A3B against Ollama and MLX-lm. The follow-up is at Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm. The most interesting finding there is that ssd_stream=true is declared but SSD I/O never actually fires for short prompts — the implementation details are easier to see once you pair the two articles.

I previously introduced the TurboQuant KV‑cache quantization algorithm, but that was a theoretical walkthrough of the Google Research paper. SwiftLM implements TurboQuant natively in C++/Metal shaders and combines it with SSD expert streaming in a working Swift inference server. No Python runtime, no GIL — it runs directly on Apple Silicon via MLX. It even ships with an iOS app.

Overview of SwiftLM

SwiftLM is a 100% native Swift LLM inference server that provides an OpenAI‑compatible API (/v1/chat/completions). Under the hood it uses Apple’s MLX framework (mlx‑swift) and the Swift Hummingbird HTTP server.

There are two main features:

On‑the‑fly KV‑cache compression via a TurboQuant V2+V3 hybrid (~3.5×)
Zero‑copy streaming of MoE expert layers from NVMe SSD to the GPU command buffer

The target hardware is a MacBook Pro M5 Pro (64 GB unified memory), designed around running Qwen3.5‑122B‑A10B‑4bit. On iOS, the app runs Qwen3 1.7B on an iPhone 13 Pro (6 GB).

graph TD
    A[HuggingFaceモデル<br/>Safetensors] --> B[SwiftLM Server<br/>macOS]
    B --> C[MLX Swift<br/>Metal GPU]
    C --> D[TurboQuant V2+V3<br/>KVキャッシュ圧縮]
    C --> E[SSD Expert Streaming<br/>NVMeゼロコピー]
    B --> F[OpenAI互換API<br/>localhost:5413]
    
    G[SwiftLM Chat<br/>iOS App] --> H[MLX Swift<br/>オンデバイス推論]
    H --> I[iPhone/iPad<br/>Metal GPU]
    
    style D fill:#e8f5e9,stroke:#4caf50
    style E fill:#e3f2fd,stroke:#2196f3

TurboQuant V2+V3 Hybrid Implementation

In Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization I explained TurboQuant’s theory (PolarQuant’s polar transform plus QJL’s 1‑bit sign correction). What’s interesting about SwiftLM is how it fuses “the speed of V2 and the quality of V3” to make the theory practical on a Metal GPU.

Open‑source TurboQuant implementations have largely split into two approaches:

Variant	Quantization method	Speed	Quality
V2 (hardware‑accelerated)	Linear affine quantization	Fast (uses MLX `quantized_matmul`)	Noticeable degradation at 3‑bit
V3 (paper‑faithful)	Nonlinear Lloyd‑Max codebook	Slow (software dequantization)	High quality

V2, as adopted by projects like turboquant-mlx, can leverage MLX’s hardware‑accelerated matrix ops so it’s fast, but linear (affine) quantization shows its limits at 3‑bit. V3 computes an optimal nonlinear codebook with the Lloyd‑Max algorithm, so quality is high, but dequantization has to run in software and is slow.

SwiftLM ports the V3 Lloyd‑Max codebook into a native C++ encoding path and performs dequantization with a fused Metal shader (bggml-metal). In other words, it keeps V3‑level quality while dequantizing at GPU speed.

What is a Lloyd‑Max codebook?

Lloyd‑Max is a scalar quantization optimization method proposed in the 1960s that iteratively finds quantization levels minimizing mean squared error (MSE) for a given input distribution. Unlike affine quantization, which rounds on an evenly spaced grid, Lloyd‑Max places more levels where data density is high and fewer where it’s low.

For TurboQuant, theory shows that the coordinates after PolarQuant’s angular transform follow a standard normal distribution N(0, 1/d). Because that distribution is known a priori, you can precompute an 8‑level (3‑bit) Lloyd‑Max codebook once before inference. A key benefit of TurboQuant is that no data‑dependent calibration is needed, but because the codebook itself is nonlinear, GPU dequantization requires a custom shader.

KV‑cache compression breakdown

SwiftLM uses different strategies for the K (Key) and V (Value) caches.

The K‑cache is quantized at 4.25 bits per dimension.

Extract the L2 norm and normalize
Apply a Fast Walsh‑Hadamard Transform (WHT) to spread outliers evenly
Quantize with a 3‑bit nonlinear Lloyd‑Max centroid
Compute the residual between the original vector and the quantized approximation
Project the residual with a random Johnson–Lindenstrauss (QJL) matrix and keep a 1‑bit sign

The V‑cache is 3.125 bits per dimension. Because it isn’t used for dot‑product attention scoring, QJL’s error correction doesn’t help; by disabling QJL for the V‑cache, SwiftLM squeezes another 25% memory reduction without harming quality.

Taken together this is about 3.6 bits/dim — roughly 3.5× compression versus FP16. As covered in Hypura’s NVMe Streaming and TurboQuant’s KV Cache Quantization, PolarQuant’s angular concentration makes the overhead for quantization constants (scale/zero‑point) essentially zero, so the effective bit‑width closely matches the nominal value. This contrasts with KIVI’s INT2, which inflates to about 2.25 effective bits.

Role of the Walsh‑Hadamard Transform (WHT)

WHT is applied before quantizing the K‑cache and acts as a rotation that evenly spreads out outliers across dimensions. LLM attention KV vectors tend to concentrate energy in a few channels; if you quantize as‑is, those channels suffer large errors.

WHT is an orthogonal transform built from additions and subtractions, with a divide‑and‑conquer structure similar to FFT. Because it requires no multiplications, it lives up to the “Fast” in its name. In SwiftLM it’s implemented as C kernels in turbo-wht.h, ported from TheTom/llama-cpp-turboquant.

SSD Expert Streaming

The second core feature of SwiftLM is zero‑copy streaming of MoE expert layers from NVMe SSD to the GPU command buffer. Enable it with the --stream-experts flag.

Why it’s needed

Qwen3.5‑122B‑A10B is a 122‑billion‑parameter MoE model; even at 4‑bit, the full weights are roughly 60 GB. On a system with 64 GB unified memory, adding the KV cache and OS working set pushes memory to the brink. When macOS unified memory exceeds the physical limit, virtual memory swapping kicks in; in the worst case the Watchdog detects a GPU hang and triggers a kernel panic.

Historically, people left this to macOS’s virtual memory and let it swap. But the resulting SSD I/O pattern isn’t aligned with the inference pipeline, causing frequent random page faults — the same issue seen in the Flash‑MoE article where mmap made performance 5× worse.

How zero‑copy streaming works

SwiftLM exploits MoE sparsity. In Qwen3.5‑122B‑A10B each layer has 256 experts, and for any given token only the top K=8 are active (num_experts_per_tok: 8). Non‑expert parts (attention, normalization, router, etc.) stay resident in GPU memory; only the expert weights are read from NVMe on demand.

graph TD
    A[推論リクエスト] --> B[非エキスパート層<br/>GPU常駐]
    B --> C[MoEルーター<br/>上位K=8を選択]
    C --> D[NVMe SSD<br/>エキスパート重みファイル]
    D -->|ゼロコピー読み込み| E[GPUコマンドバッファ<br/>直接転送]
    E --> F[Metal GPU<br/>エキスパート順伝播]
    F --> G[出力トークン]
    
    style D fill:#fff3e0,stroke:#ff9800
    style E fill:#e3f2fd,stroke:#2196f3

“Zero‑copy” means the data read from SSD is passed straight into the GPU command buffer without detouring through a user‑space buffer. Typical file I/O goes SSD → OS kernel buffer → user‑space buffer → GPU transfer; SwiftLM cuts out the intermediate copies. Apple Silicon’s Unified Memory Architecture (shared physical memory between CPU/GPU/NPU) makes this practical.

Design comparison with Flash‑MoE and Hypura

Following Flash‑MoE and Hypura, this is the third project that leverages Apple Silicon’s NVMe bandwidth for LLM inference.

	Flash‑MoE	Hypura	SwiftLM
Language	C + hand‑written Metal shaders	Rust + llama.cpp fork	Swift + MLX
Model format	Safetensors	GGUF	HuggingFace Safetensors
Target models	MoE only	MoE + Dense	MoE only
KV‑cache compression	None	None	TurboQuant V2+V3
SSD I/O method	Parallel `pread()`	`pread()` + `F_NOCACHE`	Zero‑copy GPU buffer transfer
Caching	OS page cache (71%)	Custom LRU (99.5%)	Not specified
API	None	Ollama‑compatible	OpenAI‑compatible
Python dependency	None	None	None
iOS support	No	No	Yes

Flash‑MoE demonstrated 4.36 tokens/s on a 397B‑parameter model running on a 48 GB M3 Max. Hypura’s strengths were a 99.5% cache hit‑rate and support for dense models. SwiftLM’s differentiation is the integration with TurboQuant for KV‑cache compression. By streaming experts and compressing KV by ~3.5×, it avoids memory pressure even for long‑context inference.

That said, SwiftLM’s README doesn’t publish clear tokens/s benchmarks. There is a Hacker News report of 2,694 MB active GPU VRAM, but speed comparisons are thinner than for Flash‑MoE or Hypura.

Constraints of 4‑bit quantization

SwiftLM’s README states that 4‑bit is the production standard for MoE models. With 2‑bit quantization, JSON syntax becomes unstable and "name" turns into \name\, which breaks OpenAI‑compatible tool calls.

This matches the Flash‑MoE experiments exactly. The Flash‑MoE article reported the same "name" → \name\ symptom under 2‑bit and recommended a 4‑bit setup. Two independent projects seeing the same failure suggests a structural limit to 2‑bit quantization for MoE models.

iOS support: SwiftLM Chat

SwiftLM Chat is a native iPhone/iPad app that downloads MLX models directly from HuggingFace and runs them on‑device. On an iPhone 13 Pro (6 GB) it runs Qwen3 1.7B.

It uses the same mlx‑swift as the server, but SSD Expert Streaming and TurboQuant are not included in the iOS build (iOS storage I/O and memory differ from macOS). The model catalog includes Qwen3, Phi‑3.5, Mistral, and Llama, with an indicator showing whether the model fits in the device’s RAM.

With no Python and no server, finishing inference purely on the Metal GPU is a very clean architecture for iOS LLMs.

The metallib dependency trap

There’s a setup pitfall to watch for. Metal GPU kernels are packaged in a pre‑compiled binary named default.metallib, which is vendored at a fixed version inside the mlx‑swift submodule.

The README warns in bold: if you copy mlx.metallib from Python’s mlx-metal pip package, the server starts but will crash during inference with a GPU kernel ABI mismatch (freed pointer was not the last allocation). mlx-metal (Python) and mlx-swift are in the same MLX family, but because they’re built from different versions, the Metal kernel binary interfaces don’t match.

# What actually works: grab mlx.metallib from the official Release tarball
gh release download b554 --repo SharpAI/SwiftLM \
  --pattern "SwiftLM-b554-macos-arm64.tar.gz"
tar -xzf SwiftLM-b554-macos-arm64.tar.gz
cp mlx.metallib .build/release/mlx.metallib

# Wrong approach: copying from pip mlx-metal (causes crash)
# uv run --with mlx-metal python -c "...shutil.copy(metallib, ...)"

The expected filename is mlx.metallib, not default.metallib. If you leave it named default.metallib, the server dies with “library not found” on load. Details in Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm.

The MLX ecosystem has multiple language bindings — Python (mlx), Swift (mlx‑swift), and C/C++ — each on its own release cycle. As noted in the Ollama MLX backend migration, versioning is a recurring challenge; in SwiftLM the impact is worse because incompatibility happens at the Metal‑kernel binary level.

How to run

# Build from source (use --product SwiftLM to skip the SwiftBuddy compile error)
git clone --recursive https://github.com/SharpAI/SwiftLM
cd SwiftLM
swift build -c release --product SwiftLM

# Grab mlx.metallib from the official Release
gh release download b554 --repo SharpAI/SwiftLM \
  --pattern "SwiftLM-b554-macos-arm64.tar.gz"
tar -xzf SwiftLM-b554-macos-arm64.tar.gz
cp mlx.metallib .build/release/mlx.metallib

# Run the 122B MoE model with SSD streaming
.build/release/SwiftLM \
  --model mlx-community/Qwen3.5-122B-A10B-4bit \
  --stream-experts \
  --port 5413

--stream-experts is crucial; without it, the 122B model will consume unified memory and risk a kernel panic. You can also use the --gpu-layers option to cap how many layers run on the GPU.

Because the API is OpenAI‑compatible, existing SDKs and tools can connect as‑is.

curl http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-122B-A10B-4bit",
    "stream": true,
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

Requirements: macOS 14.0 or later, Apple Silicon (M1 or newer), and Xcode Command Line Tools. Install the Metal Toolchain with xcodebuild -downloadComponent MetalToolchain.

Maturity and outlook

SwiftLM was built in 1–2 weeks as the ML inference backend for SharpAI’s local video security product (SharpAI Aegis). On Hacker News, some call it “vibe coding,” and the project lacks comprehensive benchmarks. Compared to Flash‑MoE, which published 58 experiment logs, the validation depth is thinner.

Even so, making TurboQuant’s theory run in native Metal and integrating SSD streaming plus an OpenAI‑compatible API is a well‑executed prototype. The rest depends on benchmarks.

Update (2026-04-25): corrections from the actual runs

After the initial draft I built and ran SwiftLM end-to-end. Both Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm and Running a Non-Qwen MoE on SwiftLM: Ling-flash-2.0 MXFP4 on M1 Max 64GB exposed several points that need correcting in this piece.

“Target models: MoE only” is inaccurate. mlx-swift-lm’s Models/ directory has over 50 architectures ready to go (Llama, Gemma, Mistral, Cohere, DeepseekV3, BailingMoe, GLM4, Olmo, and more). Dense models run fine. Only the SSD Expert Streaming feature is MoE-specific.
“SSD I/O: Zero-copy GPU buffer transfer” is design intent, not observed behavior. Reading the source, the implementation is straightforward dispatch_io plus POSIX pread, structurally in the same family as Flash-MoE’s “parallel pread”. More importantly, measurements show swiftlm_ssd_bytes_read_total remains 0 at the end of inference even when ssd_stream=true is set. Apple Silicon’s unified memory plus mmap plus the OS page cache mean SSD I/O never fires in practice.
“Caching: Not specified” needs an update. The actual behavior is OS page cache via mmap. Classify it alongside Flash-MoE.
TurboQuant default behavior. With --stream-experts, the startup log prints turbo_kv=disabled. KV-cache compression is opt-in and isn’t on by default, even though the README presents it as a headline feature.

Put simply, this article walks through SwiftLM’s design intent and README claims as a theoretical piece, and diverges from observed implementation behavior in several places. The two hands-on articles carry the measured numbers; read all three together for the full picture.