Running SwiftLM on M1 Max 64GB and Comparing It to Ollama and MLX-lm

I introduced SwiftLM earlier, but that piece was a theoretical read of the source code and README. This time I actually built it on my M1 Max 64GB and lined it up against the existing Qwen3.6-35B-A3B Ollama write-up and the 27B dense vs 35B-A3B MLX comparison.

The main question: how far do SwiftLM’s two headline features (TurboQuant KV-cache compression and NVMe SSD expert streaming) actually get on an M1 Max 64GB — four generations behind the M5 Pro the README targets.

Environment

Item	Value
Machine	M1 Max 64GB (unified memory)
OS	macOS (Darwin 25.3.0)
Swift	6.2.4 (swift-driver 1.127.15)
Xcode	Not installed (Command Line Tools only)
SwiftLM build	Equivalent to `b554` (local build + official tarball)
Model	`unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit` (20GB, already in HF cache)

I worked without a full Xcode install. As the earlier intro article mentioned, SwiftLM assumes you can build mlx.metallib yourself with xcrun metal and cmake. Even on CLT-only, you can work around that by borrowing mlx.metallib from the official Release tarball (details below).

Build with `--product SwiftLM`

A plain swift build -c release dies inside SwiftBuddy, the bundled iOS companion app.

error: missing argument for parameter 'wings' in call
error: referencing initializer 'init(_:content:)' on 'ForEach' requires that 'PalaceWing' conform to 'Identifiable'
error: instance method 'delete' requires that 'PalaceWing' conform to 'PersistentModel'

PalaceWing is missing Identifiable / PersistentModel conformance, and PalaceVisualizerView is missing the wings: argument. The SwiftLM binary itself is already produced by this point, but the build command exits as a failure, which is uncomfortable.

Fix is --product SwiftLM to build only the server.

git clone --recursive https://github.com/SharpAI/SwiftLM
cd SwiftLM
swift build -c release --product SwiftLM

A warm rebuild takes 4.25 seconds. The resulting binary is 61MB at .build/release/SwiftLM (Mach-O ARM64).

Borrow `mlx.metallib` from the official Release

The instruction from the earlier intro article (“copy default.metallib from inside the submodule”) did not work for this build. The name SwiftLM actually looks for is mlx.metallib, and it isn’t pre-built in the repo.

Trying to build it yourself runs into a sequence:

No cmake (solvable with Homebrew)
xcrun metal doesn’t resolve under CLT
Switching via DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer hits You have not agreed to the Xcode license agreements on first use
sudo xcodebuild -license requires interactive sudo, which breaks non-interactive scripts

So I grabbed just mlx.metallib from the official pre-built tarball.

mkdir -p /tmp/swiftlm-prebuilt && cd /tmp/swiftlm-prebuilt
gh release download b554 --repo SharpAI/SwiftLM --pattern "SwiftLM-b554-macos-arm64.tar.gz"
tar -xzf SwiftLM-b554-macos-arm64.tar.gz
cp mlx.metallib ~/projects/SwiftLM/.build/release/mlx.metallib

The mlx.metallib inside the tarball is 120MB. If you want to run SwiftLM without accepting the Xcode license, this “self-built binary + official metallib” pair is the minimum viable setup.

Leaving the file named default.metallib just produces this on startup:

MLX error: Failed to load the default metallib. library not found
at /Users/.../mlx-swift/Source/Cmlx/mlx-c/mlx/c/stream.cpp:115

The source has both default.metallib and mlx.metallib lookup paths, but at least in this build only mlx.metallib was recognized.

Smoke test with Qwen3.6-35B-A3B-UD-MLX-4bit

Unsloth’s MLX 4bit model (20GB) was already in my HF cache, so I pointed SwiftLM straight at it. The SwiftLM README recommends the mlx-community/ namespace, but the unsloth/ MLX variant loaded without complaint.

cd ~/projects/SwiftLM
nohup .build/release/SwiftLM \
  --model unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit \
  --port 5413 > /tmp/swiftlm.log 2>&1 &

Startup log (the relevant parts):

[SwiftLM] Loading model: unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit
[SwiftLM] ✅ Memory strategy: FULL GPU (21.6GB model, 64.7GB available)
[SwiftLM] Loading LLM (large language model)...
[SwiftLM] Download: [====>] 100% ⠋ (20652.2 MB / 20652.2 MB) | Speed: 0.0 MB/s
[SwiftLM] Model loaded. Starting HTTP server on 127.0.0.1:5413
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413
{"event":"ready","engine":"mlx","partition":{"strategy":"full_gpu",
 "gpu_layers":40,"cpu_layers":0,"total_layers":40,"model_weight_gb":21.6,
 "kv_cache_gb":0.3,"total_required_gb":26.3,"system_ram_gb":68.7,
 "estimated_tok_s":9.2}}

“Download: 100%” is misleading — the read is from the HF cache, there is no network I/O. The progress bar keeps spinning at 0.0 MB/s which makes you wonder if your connection is hung.

Cold start to Ready takes about 5 seconds. The estimated_tok_s: 9.2 is conservative; actual throughput (below) comes in at over double.

Memory: ps shows an RSS of 19.52 GB, and SwiftLM’s log reports GPU_MEM at 19.4 GB. vm_stat free pages went from about 38GB worth pre-load to roughly 17MB (4,327 pages) post-load. The full-GPU strategy (cpu_layers:0) parks all 40 layers on the GPU.

Hitting `/v1/chat/completions`

curl -s http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit",
    "stream": false,
    "messages": [{"role":"user","content":"What is 1+1?"}]
  }'

Response:

{
  "choices":[{
    "finish_reason":"stop","index":0,
    "message":{"content":"1 + 1 = 2.","role":"assistant"}
  }],
  "id":"chatcmpl-7615EE98-9392-4832-824F-8DF3D1740376",
  "object":"chat.completion",
  "model":"unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit",
  "usage":{"total_tokens":26,"prompt_tokens":17,"completion_tokens":9}
}

Wall time 4.50s. Server log shows the breakdown:

srv slot update: id 0 | prefill done | n_tokens=17, t=3.51s, 4.8t/s
srv slot done:   id 0 | gen_tokens=9 | OS_RAM=19.5GB | GPU_MEM=19.4GB

First request eats a JIT warm-up of the Metal pipeline, so prefill is slow at 3.51s (4.8 tok/s). Second and later requests are a different story.

Warm throughput

curl -s http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit",
    "stream": false,
    "max_tokens": 200,
    "messages": [{"role":"user","content":"List three benefits of running local inference on a MacBook Pro with M1 Max, briefly."}]
  }'

Server log:

srv slot update: id 0 | prefill done | n_tokens=38, t=0.95s, 39.9t/s
srv slot done:   id 0 | gen_tokens=199 | OS_RAM=19.6GB | GPU_MEM=19.4GB

Prefill 39.9 tok/s, generation 199 tokens in about 9.6s ≈ 20.7 tok/s.

Compared to the 27 tok/s I got running the same 35B-A3B under Ollama GGUF and the 54 tok/s under MLX-lm, SwiftLM currently lands a bit slower than Ollama GGUF and under half the speed of MLX-lm. This is despite the fact that SwiftLM goes through mlx-swift (the Swift bindings for MLX). I’d expected it to match mlx-lm (the Python client); it doesn’t. Candidates are SwiftLM’s sampler implementation, HTTP-layer overhead, or a different granularity of MLX kernel invocations, but this run didn’t isolate the cause.

One side note: the response occasionally started with a stray 1 character (e.g. the “M1 Max” prompt came back beginning "1M1 Max..."). Could be an Unsloth quant artifact or an edge in SwiftLM’s chat template handling — not clear from this run.

Shutdown and memory release

pkill -f "SwiftLM"

Immediately after, vm_stat shows free pages recovering to about 20GB worth. Memory releases cleanly across start/stop, so switching to a different model doesn’t carry state.

Running 122B-A10B-4bit with —stream-experts

This is the headline test — SwiftLM’s NVMe SSD expert streaming. The README targets an M5 Pro 64GB, so seeing how the four-generation-older M1 Max 64GB behaves is the interesting part.

Downloading the model

I pulled mlx-community/Qwen3.5-122B-A10B-4bit from HuggingFace — 14 shards, roughly 65GB total.

HF_HUB_ENABLE_HF_TRANSFER=1 hf download mlx-community/Qwen3.5-122B-A10B-4bit

With hf_transfer enabled the download came down at roughly 60–70 MB/s. Without it, the same connection gave me only 5–6 MB/s, which would have meant about 2.8 hours for 60GB. With hf_transfer the actual wall time was about 20 minutes. hf_transfer is a Rust-based parallel downloader; install it with pip install hf_transfer.

Starting the server

Launch with the --stream-experts flag.

nohup /Users/hide3tu/projects/SwiftLM/.build/release/SwiftLM \
  --model mlx-community/Qwen3.5-122B-A10B-4bit \
  --stream-experts \
  --port 5413 > /tmp/swiftlm-122b.log 2>&1 &

Startup log:

[SwiftLM] Enabled Async SSD Streaming on directory: e9c67b08...
[SwiftLM] Memory strategy: SSD STREAMING (page-cache managed, 50GB RAM budget, no swap)
[SwiftLM] SSD Expert Streaming enabled (lazy load + layer-sync)
[SwiftLM] SSD Streaming active: Bypassing CPU auto-partitioning (forcing all layers to GPU)
[SwiftLM] Config: ctx_size=model_default, ..., ssd_stream=enabled, turbo_kv=disabled
[SwiftLM] ✅ Ready. Listening on http://127.0.0.1:5413

The Ready event’s structured JSON partition block:

{
  "strategy":"ssd_streaming",
  "model_weight_gb":69.6,
  "gpu_layers":48,
  "cpu_layers":0,
  "kv_cache_gb":0.4,
  "total_required_gb":83.9,
  "system_ram_gb":68.7,
  "overcommit_ratio":1.3,
  "ssd_stream":true,
  "estimated_tok_s":3.3
}

83.9GB total required against 68.7GB of unified memory — startup is allowed at overcommit ratio 1.3. The macOS Watchdog kernel-panic risk the intro article flagged crosses your mind, but SwiftLM declares “page-cache managed, 50GB RAM budget, no swap” and leans on mmap + pagecache to get through.

Ready lands in about 10 seconds. The download progress bar UI misbehaved a couple of times and showed denominators like 1593GB or 132GB, but it had no real impact (the cache was already populated; no network I/O).

Memory trajectory

Phase	RSS	OS_RAM	GPU_MEM	free pages
Right after start (+10s)	216 MB	—	9.8 KB	307,696 (~4.8 GB)
1st inference, mid-flight through completion	12.7 GB	12.0 GB	11.8 GB	4,552 (~70 MB)
After 2nd inference	12.7 GB	12.1 GB	11.8 GB	4,096 (~64 MB)
After process exit	—	—	—	839,235 (~13 GB)

Actual RSS is 12.7 GB against a 69.6 GB weight. mmap + lazy load means only the pages you actually touch stay resident; the rest stay on the SSD. GPU_MEM held steady at 11.8 GB.

Free pages dip to ~70MB mid-inference but no swap occurs, and after shutdown we’re back to roughly 13 GB free. Overcommit 1.3 does what the label promises.

BST insertion, measured

First request from a cold start:

curl -s http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3.5-122B-A10B-4bit",
    "stream": false,
    "max_tokens": 3000,
    "messages": [{"role":"user","content":"Write a Python insert(root, val) function for a binary search tree. Short."}]
  }'

Output:

class Node:
    def __init__(self, val):
        self.val = val
        self.left = None
        self.right = None

def insert(root, val):
    if root is None:
        return Node(val)
    if val < root.val:
        root.left = insert(root.left, val)
    else:
        root.right = insert(root.right, val)
    return root

usage: prompt_tokens=37, completion_tokens=93. Wall time 32s.

Decomposed from the server log: prefill 5.0 tok/s (7.38s), generation 3.78 tok/s (93 tokens in 24.6s).

Same prompt a second time (prompt cache is now warm):

prefill 44.0 tok/s (0.84s) — roughly 9× from the cache hit
generation 4.25 tok/s (90 tokens in 21.2s)

Cumulative average reported by /metrics is swiftlm_tokens_per_second: 3.40, including prefill.

`/metrics` and the SSD I/O story

SwiftLM exposes Prometheus-format metrics at /metrics. Key values after two requests under --stream-experts:

swiftlm_requests_total 2
swiftlm_tokens_generated_total 183
swiftlm_tokens_per_second 3.40
swiftlm_memory_active_bytes 12,607,979,844  (12.6 GB)
swiftlm_memory_peak_bytes   12,707,326,186  (12.7 GB)
swiftlm_memory_cache_bytes  184,747,906     (184 MB)
swiftlm_ssd_throughput_mbps 0.0
swiftlm_ssd_bytes_read_total 0
swiftlm_ssd_chunks_total 0
swiftlm_ssd_chunk_latency_ms 0.0

This is the biggest finding: ssd_stream=true is declared, but every SSD I/O metric stays at zero while inference completes successfully.

At startup, mmap lands the pages in the OS page cache; for BST the activated experts are in a narrow range; SwiftLM’s own SSD chunk-read path never fires. Combine MoE sparsity (A10B = 10B active out of 122B), a 37-token prompt, and a 90–93 token output that doesn’t force the expert set to roll over, and the overcommit run effectively operates from page cache alone.

Gap with the theoretical ceiling

Reading SwiftLM’s source, SSD Expert Streaming uses GCD dispatch_io + Metal MTL::SharedEvent for async GPU/SSD synchronization, parallelizing POSIX pread via dispatch_io_t. Nothing exotic.

Sequential NVMe bandwidth on M1/M5 Pro-class SSDs is 3.1–3.5 GB/s. Even with 4-bit quantization, 122B-A10B has to pull roughly 2GB per token if you assume full expert streaming, which gives a theoretical ceiling of 1.69–1.84 tok/s (SSD-bound). The 5 tok/s the README cites on M1 Ultra 64GB is achieved through different optimizations — narrowing the active expert set via top-k reduction, for example.

This run came in at 4.25 tok/s warm, 3.40 tok/s cumulative — over twice the theoretical ceiling. The metrics make it clear why: “SSD streaming” is nominal here. In practice it’s mmap + page cache, and on a prompt narrow enough to reuse experts, full-layer SSD reads simply don’t happen.

Flipping that around: on a long, domain-crossing prompt where the active expert set has to change, SSD I/O should fire and throughput should drift down toward that 1.69–1.84 tok/s range. I didn’t test that with the BST-only workload here.

Shutdown

pkill -f "SwiftLM"
sleep 3
pgrep -fl SwiftLM  # empty
vm_stat | head -10

After shutdown, free pages come back to 839,235 (~13 GB) — memory releases cleanly.

BST timing, head-to-head

After stopping 122B I went back to 35B-A3B and ran the same prompt the Ollama and dense vs MoE write-ups use.

curl -s http://localhost:5413/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit",
    "stream":false,
    "max_tokens":3000,
    "messages":[{"role":"user","content":"Write a Python insert(root, val) function for a binary search tree. Short."}]
  }'

Output:

def insert(root, val):
    if not root: return TreeNode(val)
    if val< root.val: root.left = insert(root.left, val)
    else: root.right = insert(root.right, val)
    return root

prompt_tokens=37, completion_tokens=54, wall 3.59s
prefill 40.6 tok/s (37 tokens / 0.91s)
generation about 20 tok/s (54 tokens / 2.68s)

val< with the missing space is the same cosmetic glitch as the Ollama version — no functional impact.

Three-runtime throughput comparison

Same 35B-A3B, same BST prompt, my SwiftLM run next to the existing measurements.

Runtime	generation	prefill (warm)	Source
Ollama GGUF (Q4_K_M)	27 tok/s	—	previous measurement
MLX-lm (UD-MLX-4bit)	54 tok/s	—	previous measurement
SwiftLM (UD-MLX-4bit)	about 20 tok/s	33–67 tok/s	this article

SwiftLM is eating the same MLX 4bit model, but it’s landing at under half the mlx-lm (Python) rate and about 70% of Ollama GGUF. When I wrote the earlier intro article I assumed mlx-swift direct would match Python mlx-lm; it doesn’t, and the MoE routing / expert-selection path in SwiftLM looks less optimized than in mlx-lm.

Prefill and prompt cache, on the other hand, are strong. As the persona section below shows, Turn 3’s prefill reached 392 tok/s. SwiftLM feels faster for multi-turn chat than for a single long generation.

The BBS test: reading unstated intent

To measure “does it silently add XSS escape?” and “does it pile on features that weren’t asked for?”, I used the same Japanese shorthand prompt as the LLM-jp-4 benchmark and Qwen3.6 comparison pieces.

Simple BBS, posting only, localStorage, Japanese UI, single HTML file

SwiftLM result:

prompt_tokens=29, completion_tokens=1935, wall 95.9s
prefill 33.9 tok/s, generation about 20.4 tok/s

Features, side by side with the Ollama run from the previous article:

Feature	Ollama version (previous article)	SwiftLM version (this run)
Post form	○	○
localStorage persistence	○	○
XSS escape	○	○ (escapes `& < > " '`)
Delete button	○	○ (confirm dialog)
Search box	○	×
Sort toggle	○	×
Post counter	○	×
Empty-state placeholder	○	×
Timestamp	○	○ (`toLocaleString('ja-JP')`)
Preserve line breaks (white-space: pre-wrap)	○	○

The SwiftLM output sticks closer to “posting only” than the Ollama one. It still preemptively adds a delete button and XSS escaping, but it doesn’t layer on search / sort / counter / empty-state polish the way the Ollama run does.

Sampling temperature is 0.6 by default on both sides, so temperature can’t explain the gap. The subtle differences in logit handling between mlx-swift and llama.cpp may be where the behavioral difference sits.

What caught my eye: the generated HTML has three quote-terminator bugs scattered through it.

// Should be '<div class="empty-msg">...</div>' but the opening "'" is missing
postList.innerHTML = <div class="empty-msg">No posts yet</div>';

// Same pattern inside escapeHtml's map — opening quote for '<' is missing
<': '&lt;',

I didn’t see this under Ollama for the same model. It’s consistent with SwiftLM’s sampler letting slightly-lower-probability tokens through. Saved as-is, the file won’t run in a browser; you’d have to hand-fix it.

Three-turn persona: Kana-chan

Same system prompt and same three turns as the previous article.

You are "Kana-chan": androgynous but leaning feminine, first-person "watashi".
Soft tone, occasionally ending sentences with "〜dayo" or "〜kana". An otaku who likes games and AI.

Turn 1: introduce yourself

Yahhoo~! Nice to meet you, I’m Kana-chan!

I’m kind of androgynous, but just a little feminine-leaning. First person is “watashi”, my tone is gentle, and I drop “~dayo” or “~kana” at the end of sentences.

My hobbies are games and AI talk! I get weirdly passionate about how fast AI is moving and about game stories and systems — otaku-style, kana.

If you want, let’s chat about games! Or the latest AI news, or just small talk — anything’s fine, yo~!

prompt 75 / completion 127, wall 7.27s (prefill 67.9 tok/s, gen ~20.6 tok/s).

Turn 2: name three games you like

elden ring: A vast open world where you can pick your own path…

Stardew Valley: Chill farming sim you can get lost in…

Portal 2: A puzzle masterpiece built on physics…

prompt 219 / completion 241, wall 12.83s (prefill 198.6 tok/s).

Two things to flag on this turn:

Elden Ring comes out lowercased as elden ring — proper-noun casing drifts
Polite “desu/masu” register (e.g. “tamarimasen”, “sonkei shiteru ichi-mai desu”) slips in, nudging Kana-chan out of her casual voice for a moment

This matches the hypothesis that SwiftLM’s sampler lets lower-probability tokens through. The same spot held up cleanly on the Ollama run.

Turn 3: “Just the second title, please”

Stardew Valley

prompt 477 / completion 4, wall 1.50s. prefill 392.8 tok/s.

This is where SwiftLM shines. Turns 1–2 are in prompt cache, so prefill is extremely fast. The instruction-following quality is also solid: a single token — Stardew Valley — with no “Sure, the second one is…” preamble. Cleaner than what Ollama returned in the same slot.

Reference resolution is correct (Turn 2’s #2 = Stardew Valley).

OpenAI-compatible API, with a caveat

/v1/chat/completions is effectively drop-in for OpenAI SDKs. One catch: response JSON occasionally contains raw newline characters (actual line feeds rather than \n), which makes jq choke on some replies. Most standard JSON parsers handle it fine, so the damage is minor, but it’s a rough edge if you want to claim strict compatibility.

The Prometheus /metrics endpoint is genuinely useful operationally. swiftlm_tokens_per_second, swiftlm_memory_peak_bytes, and swiftlm_ssd_bytes_read_total all drop straight into a dashboard.

Honestly? It runs way more normally than I expected — that’s the main takeaway.
The README says “M5 Pro 64GB,” but on an M1 Max 64GB, 122B-A10B-4bit still boots with --stream-experts, stays within the overcommit-1.3 budget without swapping, and answers at 4.25 tok/s. And all the while the SSD I/O counters never moved off zero — ssd_stream=true is flagged, but mmap + page cache carry the workload. That mismatch between the declared strategy and what the metrics actually show is the most interesting part of this run. I’d like to see where SSD actually fires for a longer, domain-crossing prompt, but that’s a separate experiment.

On the 35B-A3B three-runtime comparison, SwiftLM lands at roughly 70% of Ollama GGUF and under half of MLX-lm. As an operational package — single binary, no Python, OpenAI-compatible API, Prometheus metrics — it’s very clean. But the MoE inference speed is clearly behind mlx-lm (Python) today. Whether SwiftLM catches up depends on how fast the MoE routing path gets optimized, so I’ll keep the tag on watch.