oMLX 0.3.9.dev2 tested on M1 Max 64GB: SSD cache wins, VLM MTP slower
Contents
This is the follow-up to my earlier read of the oMLX 0.3.9.dev2 release notes. I tested everything on M1 Max 64GB and wrote up the numbers as I went.
The focus: SSD KV cache, VLM MTP (both 26B A4B and 31B Dense), DFlash engine (single-shot and 10-turn conversation), TurboQuant KV quantization, whether continuous batching does anything for 5 parallel requests, Thinking budget, oQ quantization, specprefill, Qwen2.5-VL 7B vs Gemma 4 VLM on quality and speed, cold-start tax, and whether omlx launch can actually wire up GitHub Copilot, OpenAI Codex, and Claude Code to a local LLM. 11 separate tests.
For VLM inputs I generated three “kana-chan” images with WAI-Anima (one short-question, one long-description, one OCR-leaning).
No deeper reason — I just needed consistent test images.
Test environment
| Item | Value |
|---|---|
| Machine | MacBook Pro M1 Max 64GB unified memory |
| OS | macOS 26.3 Tahoe (Build 25D125) |
| Python | 3.13.11 (in a venv) |
| oMLX | 0.3.9.dev2 (pre-release, installed from source) |
| Main model | mlx-community/gemma-4-26B-A4B-it-4bit (15.26GB) |
| MTP draft | mlx-community/gemma-4-26B-A4B-it-assistant-bf16 (0.82GB) |
| Extra models | gemma-4-31B-it-4bit + gemma-4-31B-it-assistant-bf16 |
| Metrics | TTFT, prefill tokens/sec, generation tokens/sec, generation_duration, prompt_tokens, cached_tokens |
| Method | Streaming + last chunk with stream_options.include_usage=true |
oMLX reserves paged SSD cache: ~/.omlx/cache (max: 92.6GB) by default at startup.
I capped it with --paged-ssd-cache-max-size 20GB to avoid eating the disk.
Where oMLX install bites you
Naively, you’d try pip install omlx==0.3.9.dev2. It’s not on PyPI.
The Homebrew formula isn’t on the default tap either, so brew info omlx returns No available formula.
$ pip install --pre omlx==0.3.9.dev2
ERROR: Could not find a version that satisfies the requirement omlx==0.3.9.dev2 (from versions: none)
ERROR: No matching distribution found for omlx==0.3.9.dev2
$ brew info omlx
Error: No available formula with the name "omlx". Did you mean mlx?
The Releases page has two .dmg files (macos15-sequoia / macos26-tahoe), but as the README spells out, the macOS app does not install the omlx CLI.
If you want omlx launch copilot or omlx serve directly, clone the source and pip install -e ..
mkdir -p ~/omlx-test && cd ~/omlx-test
python3 -m venv .venv && source .venv/bin/activate
git clone --depth 1 --branch v0.3.9.dev2 https://github.com/jundot/omlx.git
cd omlx
pip install -e .
After that, omlx serve / omlx launch / omlx diagnose work.
Python 3.13.11 inside a venv builds cleanly on macOS 26.3 Tahoe / M1 Max. Dependencies include mlx 0.31.2, mlx-vlm 0.5.0, dflash-mlx 0.1.5.1.
Side note on disk: between Python wheels and MLX model caches, you can easily need close to 100GB of free space. Before running into Could not find a version errors you’ll run out of disk first. I had to clear ~130GB of older image-generation models from my Hugging Face cache to make room.
Test 1: How much does the SSD KV cache shorten the second prefill?
oMLX enables a paged KV cache by default, with a hot RAM layer and a cold SSD layer. I cleared the cache via /admin/api/ssd-cache/clear before each test, then sent the same prompt 3 times. Run 1 = cold, runs 2 and 3 = warm.
A small gotcha worth flagging: the streaming response’s final chunk contains time_to_first_token, prompt_eval_duration, generation_duration, prompt_tokens_per_second, generation_tokens_per_second. The non-streaming response returns null for those fields. Streaming-on + stream_options: { "include_usage": true } is what makes them fire.
Prompts used
“Long system prompt + short user question” shape. The system prompt simulates a coding-agent context: FastAPI + Postgres + Celery SaaS, operational rules, past incidents, code style, tool definitions.
- 828-token version: 11 operational rules + 4 incidents + 6 tool definitions + a short code style section
- 1555-token version: A larger one. 9 stack layers + 11 rules + 10 incidents + 4 languages of code style + 9 tool definitions + an architecture detail subset
The user message in both cases was a troubleshooting request: /v1/events returns 500 only for tenants whose tenant_id starts with “a”. max_tokens=80~100, temperature=0 fixed.
Result: Gemma 4 26B A4B, 828 prompt tokens
| Metric | cold (1st) | warm (2nd) | warm (3rd) | delta |
|---|---|---|---|---|
| TTFT | 2.35s | 1.77s | 1.77s | -24.7% |
| prefill tokens/sec | 352 | 469 | 467 | +33.0% |
| generation tokens/sec | 53.7 | 53.7 | 53.7 | ±0% |
cached_tokens field | 0 | 0 | 0 | — |
Result: Gemma 4 26B A4B, 1555 prompt tokens
| Metric | cold | warm1 | warm2 | delta |
|---|---|---|---|---|
| TTFT | 3.95s | 1.28s | 1.27s | -67.6% |
| prefill tokens/sec | 394 | 1213 | 1228 | +208% |
| generation tokens/sec | 54.4 | 54.5 | 54.6 | ±0% |
cached_tokens field | 0 | 0 | 0 | — |
Reading the numbers
Generation tokens/sec stays at around 54 in every case. That’s expected — the KV cache doesn’t help the decode phase.
Prefill side (TTFT) effect depends strongly on prompt length.
At 828 tokens you get a 24% reduction. At 1555 tokens it’s 67%. For Claude Code / Codex-style workloads that carry tool definitions + repo context + conversation history, the longer the context grows, the more the second-and-later TTFT pulls away from cold.
A quiet observation: oMLX never populates prompt_tokens_details.cached_tokens — it’s always 0. If you check “did the cache hit?” through an OpenAI-compatible client, you’d think nothing was cached. You actually have to read TTFT and prompt_tokens_per_second improvements to confirm.
This missing field isn’t called out in the release notes and trips up anyone building a benchmark harness.
Test 2: Gemma 4 VLM MTP on/off
oMLX exposes vlm_mtp_enabled and vlm_mtp_draft_model. You set them, then reload the model.
I paired 26B A4B (4bit) as the main model with gemma-4-26B-A4B-it-assistant-bf16 as the draft.
Test images
Three kana-chan images generated with WAI-Anima.
Character consistency holds across all three (side ponytail with a blue scrunchie, ahoge, school uniform). Backgrounds and pose details differ to vary prompt difficulty.
01_portrait — simple composition. Upper body, cafe interior, coffee cup.
02_fullbody — busy background. Tokyo street at dusk, Japanese neon signs, passersby behind, dynamic turn-around pose.
03_ocr — text-heavy. At a desk with a laptop, bookshelf, sticky notes, a “WAI” mug. Prepared for OCR-leaning prompts.
Prompts used
| Prompt | Content |
|---|---|
| short | ”What is shown in this image? Answer in one short sentence.” |
| long | ”Describe this image in detail. Cover the character, clothing, setting, background, lighting, and overall atmosphere. Use multiple sentences.” |
| ocr | ”List any visible text, labels, or written characters in the image. Output as a bullet list.” |
temperature=0, max_tokens=120. For each image × prompt combination I took the 2nd run after a warmup.
Results (gen_d = generation_duration, gen_tps = generation tokens/sec)
Each cell is the 2nd-and-after value (post-warmup).
01_portrait (simple composition)
| Prompt | MTP | TTFT | gen_d | gen_tps | output tok | delta (gen_tps) |
|---|---|---|---|---|---|---|
| short | OFF | 0.85s | 0.38s | 55.91 | 21 | — |
| short | ON | 0.85s | 0.44s | 48.20 | 21 | -13.8% |
| long | OFF | 0.86s | 2.12s | 56.59 | 120 | — |
| long | ON | 0.84s | 2.72s | 44.05 | 120 | -22.2% |
| ocr | OFF | 0.89s | 0.30s | 55.95 | 17 | — |
| ocr | ON | 0.81s | 0.30s | 56.68 | 17 | +1.3% |
02_fullbody (busy background)
| Prompt | MTP | TTFT | gen_d | gen_tps | output tok | delta |
|---|---|---|---|---|---|---|
| short | OFF | 0.87s | 0.33s | 54.45 | 18 | — |
| short | ON | 0.79s | 0.34s | 52.86 | 18 | -2.9% |
| long | OFF | 0.89s | 2.13s | 56.26 | 120 | — |
| long | ON | 0.86s | 2.44s | 49.25 | 120 | -12.5% |
| ocr | OFF | 0.87s | 2.09s | 57.41 | 120 | — |
| ocr | ON | 0.84s | 2.39s | 50.30 | 120 | -12.4% |
03_ocr (lots of text in the image)
| Prompt | MTP | TTFT | gen_d | gen_tps | output tok | delta |
|---|---|---|---|---|---|---|
| short | OFF | 0.84s | 0.34s | 56.32 | 19 | — |
| short | ON | 0.84s | 0.33s | 54.23 | 18 | -3.7% |
| long | OFF | 0.83s | 2.09s | 57.29 | 120 | — |
| long | ON | 0.80s | 2.69s | 44.57 | 120 | -22.2% |
| ocr | OFF | 1.38s | 0.31s | 55.64 | 17 | — |
| ocr | ON | 0.85s | 0.33s | 51.20 | 17 | -8.0% |
Reading the numbers
On M1 Max 64GB + Gemma 4 26B A4B 4bit + assistant-bf16 draft, VLM MTP makes everything slower.
The 120-token long description loses 22%. The long OCR output loses 12%. For short outputs (15-20 tokens) the differences are noise.
This matches my prior Gemma 4 text-side MTP measurement where 31B Dense and E4B got slower.
The official benchmark says “Gemma 4 image + text requests now decode noticeably faster”, but on Apple Silicon single-shot (batch=1) the draft model’s forward + verification overhead seems to outweigh the savings.
TTFT (prefill side) barely moves with MTP on or off. MTP targets decode speed-up, not prefill, so it’s a different axis from SSD cache.
Does 31B Dense reproduce this?
I wanted to rule out 26B-A4B-specific behavior (it’s a MoE), so I ran the same 3 images × 3 prompts on 31B Dense.
| Image | Prompt | MTP_OFF gen_tps | MTP_ON gen_tps | delta |
|---|---|---|---|---|
| 01_portrait | short | 14.86 | 12.07 | -18.8% |
| 01_portrait | long | 15.14 | 12.50 | -17.4% |
| 01_portrait | ocr | 14.76 | 14.06 | -4.7% |
| 02_fullbody | short | 14.86 | 14.86 | ±0% |
| 02_fullbody | long | 15.12 | 12.06 | -20.2% |
| 02_fullbody | ocr | 14.98 | 10.53 | -29.7% |
| 03_ocr | short | 14.84 | 14.47 | -2.5% |
| 03_ocr | long | 15.09 | 11.65 | -22.8% |
Same direction as 26B A4B — 31B Dense slows down on longer outputs with MTP_ON. 02_fullbody/ocr hits -29.7%. It’s not a MoE-vs-Dense asymmetry.
TTFT stays flat around 4.4s. Only decode loses.
31B Dense’s baseline gen_tps is 14-15 tps (about a quarter of 26B A4B’s 56 tps). Running 31B Dense on Apple Silicon is just heavier on its own.
Two small but real gotchas:
model.mtp_compatibleonly reflects text-side MTP, not VLM MTP. VLM MTP is a separate flag. You can’t tell from the README.- After setting
vlm_mtp_enabled=true, you have to unload + load the model. The setting doesn’t take effect until you reload.
Does the output change? (temperature=0, same image, same prompt)
Speculative-decoding-style speedups should be lossless. At temperature=0, the body-model-only output and the draft+verify output should be byte-identical.
I ran 01_portrait’s long description twice at max_tokens=200 in baseline / vlm_mtp_on / dflash_on configurations, and hashed the outputs.
| Input | baseline (1) | baseline (2) | vlm_mtp_on (1) | vlm_mtp_on (2) | dflash_on |
|---|---|---|---|---|---|
| Text-only (Python fn) | ca96bf14... | (same) | ca96bf14... | (same) | ca96bf14... |
| 01_portrait (long desc) | 835c37a9... | 835c37a9... | 770090a9... | 770090a9... | 835c37a9... |
| 03_ocr (long desc) | 6f2620a5... | (same) | 89835b69... | (same) | 6f2620a5... |
- Text-only: all three configs match. MTP and DFlash are both fully lossless.
- Image + text (VLM): dflash_on matches baseline; vlm_mtp_on differs from baseline. Deterministic within the same config (run1 == run2), but cross-config the output diverges.
Concrete diff on 01_portrait’s long description:
-An anime-style digital illustration depicts a young woman with light brown hair...
-Her hair is styled in a playful manner, with long bangs framing her face and a small ponytail held by a light blue scrunchie.
+An anime-style digital illustration depicts a young girl with light brown hair...
+Her hair is styled in a playful way, with a prominent cowlick at the top and a side ponytail held by a light blue scrunchie.
-A small potted plant sits on the windowsill, and a hanging lamp provides warm illumination from above.
+A small potted plant sits on the windowsill, and a hanging lamp provides warm indoor lighting.
The concrete differences:
- “young woman” → “young girl”
- “long bangs framing her face and a small ponytail” → “a prominent cowlick at the top and a side ponytail”
- “warm illumination from above” → “warm indoor lighting”
Funnily enough, the VLM MTP version sometimes picks up image features more accurately (“side ponytail” matches the actual side ponytail, “prominent cowlick” matches the ahoge).
But for benchmarking purposes a “speedup that gives different outputs for the same input” is hard to evaluate. Either the VLM MTP accept criterion is loose, or there’s a small numerical divergence between the draft and main paths.
dflash_on and text-only MTP return byte-identical outputs (hash match). The lossless guarantee holds there. Only the VLM MTP in 0.3.9.dev2 pre-release behaves differently.
Honestly, after working through this section I’m left with a “is this thing really working?” feeling.
The pitch is “faster”, but it’s slower. The minimum baseline for speculative decoding is “lossless at temperature=0”, but outputs diverge. As the headline feature in the release notes, the behavior is rough.
Pre-release dev2 quality isn’t expected to be settled, and there are scenarios (batch>1 server, parallel requests, much longer prompts) where it might still pay off. But within “M1 Max single-shot, temperature=0, 120-token generation”, it doesn’t look ready for actual use. Leave vlm_mtp_enabled off for now.
Test 3: DFlash engine vs plain MLX engine
Toggle dflash_enabled true/false, then reload the model.
Prompt is text-only (a ~400 token incident post-mortem request, 200 token output). Each config gets a 2-run average after warmup.
Prompt used
“Given a past incident (event ingestion pipeline returned 500, a migration locked a partition for 18 minutes, the Kafka consumer OOM’d, recovery took 47 minutes), write a 200-word post-mortem with sections: Summary / Impact / Timeline / Root Cause / Detection / Resolution / Action Items.” Text-only request.
Result (Gemma 4 26B A4B, warm runs averaged)
| Config | TTFT | prefill tokens/sec | gen_duration | gen tokens/sec |
|---|---|---|---|---|
| baseline (plain MLX) | 0.69s | 279.2 | 3.54s | 56.37 |
| DFlash ON | 0.68s | 285.5 | 3.50s | 57.11 |
| DFlash + draft FP16 boost | 0.69s | 285.7 | 3.54s | 56.55 |
Reading the numbers
For a single 200-token generation, DFlash gives roughly +1.3% gen_tps. Within noise.
Per the release notes, DFlash’s value is supposed to land on “long sessions where prefix cache churns less”, “DFlash-specific quantization choices”, and “FP16 draft model boost”. A single one-shot workload probably can’t surface that.
Also during unload operations, /admin/api/models/{id}/unload sometimes returns 400 (Model not loaded). When the setting change internally triggers an unload first, a follow-up unload double-fires the 400. Harmless, but if you write unload→load in every bench script, you need to handle that exception.
Re-measured with a 10-turn conversation (looking for DFlash’s real role)
Since one-shot was flat, I ran a 10-turn conversation. Prompt grows from 107 tokens at turn 1 to 1229 tokens at turn 10. Scenario: keep asking the model to troubleshoot a FastAPI CPU-pinned problem.
| Turn | prompt_tok | baseline TTFT | DFlash TTFT | baseline gen_tps | DFlash gen_tps | gen_tps delta |
|---|---|---|---|---|---|---|
| 1 | 107 | 5.79s | 7.31s | 56.32 | 57.72 | +2.5% |
| 2 | 219 | 0.67s | 0.66s | 55.72 | 57.58 | +3.3% |
| 3 | 353 | 0.89s | 0.90s | 54.83 | 56.26 | +2.6% |
| 4 | 445 | 1.11s | 1.08s | 55.01 | 56.17 | +2.1% |
| 5 | 579 | 1.37s | 1.33s | 52.19 | 54.22 | +3.9% |
| 6 | 701 | 1.58s | 1.54s | 52.40 | 53.68 | +2.5% |
| 7 | 818 | 1.75s | 1.76s | 52.72 | 53.25 | +1.0% |
| 8 | 963 | 2.04s | 1.98s | 50.88 | 52.32 | +2.8% |
| 9 | 1087 | 2.41s | 2.40s | 54.38 | 54.36 | ±0% |
| 10 | 1229 | 0.63s | 0.64s | 54.35 | 54.38 | ±0% |
Average +2.3% gen_tps. Consistently positive but not dramatic, far from the “no more bad churn” pitch in the release notes.
TTFT is essentially the same. The SSD cache layer handles both configs equally, and from turn 2 on, almost the entire previous-turn prefix is cache-hit. Turn 10’s sudden jump to >1900 prefill_tps was a lucky high cache hit ratio.
DFlash’s actual value probably shows up under (a) concurrent parallel requests, (b) prefix cache churn scenarios (multiple conversations sharing the same server in production), (c) combined with FP16 draft model boost or ParoQuant. A single-user 10-turn workload lands on a flat improvement.
Test 4: TurboQuant KV quantization
Quantizes the KV cache on the fly to reduce memory. Toggle turboquant_kv_enabled and pick turboquant_kv_bits (2/4/8).
I ran 4 configurations with a 442-token text-only prompt and 200-token output.
| Config | cold TTFT | warm TTFT | cold gen_tps | warm gen_tps |
|---|---|---|---|---|
| baseline (no KV quant) | 1.42s | 1.04s | 56.34 | 56.44 |
| TurboQuant KV 8-bit | 1.37s | 1.06s | 56.36 | 56.29 |
| TurboQuant KV 4-bit | 1.39s | 1.05s | 56.08 | 55.26 |
| TurboQuant KV 2-bit | 1.41s | 1.06s | 56.09 | 56.20 |
All within noise. At 442 tokens the KV cache itself is only a few MB, so quantization-on/off doesn’t move memory or speed measurably.
TurboQuant KV should help with 10K+ token long-context workloads or concurrent batched serving where KV memory dominates. This single-shot short-prompt test couldn’t surface it.
For what it’s worth, /admin/api/stats returns current_model_memory: n/a. I wanted to confirm memory deltas after toggling settings and got nothing. Numbers visible in the admin UI aren’t fully exposed via the API.
Test 5: Parallel-request load test (continuous batching)
oMLX serve advertises continuous batching. I sent the same text-generation task in 5 parallel requests and compared against sequential. Each request: ~30 token prompt, 150 token output × 5.
| Mode | Total wall | Throughput (total gen_tps equiv) |
|---|---|---|
| Sequential 5 reqs | 13.81s | ~58 tps × 5 runs |
| Parallel (concurrency=5) | 8.42s | ~20 tps × 5 in parallel |
5 simultaneous requests give a 1.64× throughput improvement. Per-request gen_tps drops from 58 to 20, but 5 of them run together so total wall time goes 13.81s → 8.42s.
DFlash ON shows the same 1.64×, no additional gain. In this scenario, continuous batching itself is what’s doing the work.
DFlash’s production sweet spot is more likely “higher parallelism (10/20/50)”, “longer prompts (10K+ tokens)”, “heavy cache churn” combined together. 5-parallel short-prompts couldn’t surface its difference.
Test 6: Thinking budget (doesn’t work on Gemma 4)
There are settings enable_thinking and thinking_budget_tokens. They should be relevant for thinking models like Qwen3.x, so I tried enabling them on Gemma 4 26B A4B.
| Config | Output |
|---|---|
| baseline | Normal text output (400 tokens, gen_tps 57.9) |
| enable_thinking=true | Empty output (400 tokens generated, all empty) |
| thinking_budget=200 | Empty output |
| thinking_budget=1000 | Empty output |
Gemma 4 doesn’t natively support thinking output, so when thinking mode is on, all generated tokens get treated as internal “thought” and nothing is returned to the user.
oMLX doesn’t warn that “this model doesn’t support thinking” — the setting is exposed in the admin UI, so a user who flips it on will silently get empty responses indefinitely.
Test 7: oQ quantization (assistant 4-bit attempt)
oQ sensitivity measurement is one of dev2’s headline features. I tried it against the assistant-bf16 (800MB). Going straight to the 26B/31B main model would have needed bf16 versions and pushed disk hard, so I started with the smaller model.
$ curl -s -b cookie -X POST 'http://127.0.0.1:8765/admin/api/oq/start' \
-H 'Content-Type: application/json' \
-d '{"model_path":"~/.omlx/models/mlx-community/gemma-4-26B-A4B-it-assistant-bf16","oq_level":4,"text_only":true}'
{"success":true,"task":{"task_id":"c5225f40-...","output_name":"gemma-4-26B-A4B-it-assistant-oQ4",
"output_path":"~/.omlx/models/gemma-4-26B-A4B-it-assistant-oQ4","status":"pending"}}
The estimate endpoint up front: 800MB bf16 input → 235MB output (oq_level=4), 6.2GB peak memory during quantization, effective bpw 4.7.
Starting the task got it through status: pending → in progress, then it immediately stopped at status: failed.
{
"status": "failed",
"progress": 15.0,
"phase": "Loading model...",
"error": "oQ4: sensitivity measurement produced no scores. Check the preceding log lines for the root cause (model load, calibration data, or layer discovery), and either fix it or pass an explicit sensitivity_model_path."
}
The error suggests oQ’s sensitivity measurement ended with “no scores”.
The assistant model is the MTP draft and has num_layers: 4 (the main 26B/31B have dozens of layers). oQ can’t extract a sensitivity peak-and-valley signal from 4 layers. So oQ doesn’t work on draft models.
To exercise oQ properly you’d need a bf16 main model (gemma-4-26B-A4B-it-bf16 or gemma-4-31B-it-bf16). Given time and disk constraints, I stopped at confirming the estimate endpoint, start endpoint, and how it fails.
The auto_proxy_sensitivity field — advertised in the release notes as “oQ proxy auto-build when model exceeds RAM” — defaults to true on start. But the failure here isn’t “low memory”, it’s “not enough layers”, which is a different rescue path. Both look rough on this pre-release.
Test 8: specprefill (bonus — didn’t actually run)
Aside from DFlash in Test 3, oMLX has a specprefill_enabled setting. Same speculative idea but on the prefill side. It looked like a TTFT-side option alongside the SSD cache, so I tried it.
26B A4B as main, gemma-4-26B-A4B-it-assistant-bf16 as draft, two patterns: specprefill_keep_pct = 0.5 and = 0.3. The same 866-token text prompt I used for Gemma 4 text MTP.
| Config | cold TTFT | warm TTFT | cold prefill_tps |
|---|---|---|---|
| baseline | 7.99s | 1.88s | 108.4 |
| specprefill ON (keep_pct=0.5) | 7.48s | 1.86s | 115.8 |
| specprefill ON (keep_pct=0.3) | 7.33s | 1.91s | 118.2 |
Looks like noise. warm is dominated by the SSD cache, so it’s not a useful axis. cold looks like a 2-8% improvement, but the server log tells a different story:
2026-05-13 18:27:06,561 - omlx.engine.vlm - ERROR - SpecPrefill: draft model load failed: 404 Client Error.
2026-05-13 18:27:38,074 - omlx.engine.vlm - ERROR - SpecPrefill: draft model load failed: 404 Client Error.
The specprefill_draft_model I pointed at gemma-4-26B-A4B-it-assistant-bf16 is trying to download from Hugging Face and failing (the model is already local).
The ERROR is logged but the request still completes — there’s a fallback path that runs the request with the main model alone. So specprefill effectively isn’t doing anything, and running just the main model gives the same numbers as baseline.
The 2-8% cold-side drift is just noise from cache state between runs.
specprefill isn’t called out in the release notes, but the settings fields specprefill_enabled / specprefill_draft_model / specprefill_keep_pct / specprefill_threshold do exist. Reasonable to assume it’s an unfinished feature in pre-release. Don’t touch it.
Test 9: Qwen2.5-VL vs Gemma 4 VLM (Ollama vs oMLX)
For local VLM options, I compared Ollama’s qwen2.5vl:7b (6GB) against oMLX’s Gemma 4 26B A4B 4bit (15GB) on the same 3 images. Prompt is the long-description one from the VLM MTP test, max_tokens=200, temperature=0.
| Image | Backend | Wall time | Output chars | Notes |
|---|---|---|---|---|
| 01_portrait | oMLX Gemma 4 26B A4B | 4.32s | 975 | image encoder warmup’d |
| 01_portrait | Ollama Qwen2.5-VL 7B | 24.73s | 790 | |
| 02_fullbody | oMLX Gemma 4 26B A4B | 90.12s | 1000 | image encoder cold, TTFT 86.6s |
| 02_fullbody | Ollama Qwen2.5-VL 7B | 30.75s | 1024 | |
| 03_ocr | oMLX Gemma 4 26B A4B | 88.53s | 942 | same as above, TTFT 85.0s |
| 03_ocr | Ollama Qwen2.5-VL 7B | 31.15s | 942 |
How the speed structure works
Gemma 4 26B A4B takes about 85 seconds to start up the image encoder for an unseen image (on M1 Max 64GB). Once an image is warmed up, a follow-up question on the same image is around 4 seconds.
Qwen2.5-VL 7B in Ollama is stable at 24-30 seconds per image. Either MLX’s JIT/Metal kernel compile caching is wired differently, or it’s just the model size, probably both.
For “ask many questions about the same image”, Gemma 4 is dramatically faster (warm 4s vs cold 30s). For “process many different images one after another”, Qwen2.5-VL is more predictable at 30s/image.
Output quality (01_portrait)
Both descriptions:
[Gemma 4 26B A4B]
An anime-style digital illustration depicts a young woman with light brown hair and
large, expressive amber eyes sitting at a wooden table in a cozy cafe. Her hair is
styled in a playful manner, with long bangs framing her face and a small ponytail
held by a light blue scrunchie. She has a gentle, friendly expression with a slight
blush on her cheeks. She is wearing a white, long-sleeved button-down blouse
featuring a small red bow at the collar. Her hands are delicately holding a white
ceramic cup filled with dark coffee, resting on a matching saucer on the table in
front of her.
[Qwen2.5-VL 7B]
The image depicts an anime-style character with light brown hair styled in two
pigtails, each adorned with a blue hair tie. The character is wearing a white blouse
with a red bow tie, giving a school uniform-like appearance. She is holding a white
coffee cup with both hands, suggesting she is enjoying a warm beverage. The setting
appears to be a cozy café with wooden furniture, including tables and chairs, and a
window that lets in natural light. The background includes a chalkboard menu on the
wall and some potted plants, adding to the warm and inviting atmosphere.
Gemma 4 picks out “side ponytail with a light blue scrunchie”, “small red bow at the collar”, “ceramic cup filled with dark coffee”.
Qwen2.5-VL 7B misidentifies as “two pigtails” (the actual character has a single side ponytail, not twin tails). It says “two pigtails” on all 3 images — the hairstyle recognition is consistently wrong.
On 03_ocr though, Qwen2.5-VL 7B picks up the WAI mug and the Apple Logo — so for OCR-leaning prompts it can come out ahead.
26B vs 7B, quantization precision, training data — there are too many variables for either model to win uniformly. But “decent description accuracy + OCR” on a Mac probably starts at Qwen2.5-VL 7B as the default, switching to Gemma 4 only when needed.
Test 10: Cold-start tax (cost of the first call after model load)
In real ops where you unload/load models, how slow is the first response after loading? I measured four scenarios.
| Scenario | 1st call wall | TTFT |
|---|---|---|
| A: fully warm (model loaded, SSD cache warm) | 0.23s | 0.18s |
| B: model loaded + SSD cache cleared | 0.25s | 0.20s |
| C: unload → load → SSD cache clear → first call | 9.79s | 9.74s |
| C’: repeat of C | 8.99s | 8.94s |
Load time itself: unload 0.9s, load 9.7-11.9s.
On top of that, the first inference after load has a +9-10 second warmup tax. Looks like Metal kernel JIT compilation or first-time MLX graph optimization. If you spin up omlx serve and start hammering it immediately (CI integration, automated benchmarks), expect “30 seconds without a response” without budgeting for this.
Total cold-start budget: about 20-22 seconds.
Also: clearing the SSD cache against a warm model with a small prompt (A → B) doesn’t really show a difference. The big effect from Test 1 only surfaces once the prompt gets long enough.
Test 11: Connecting agent CLIs to local LLMs via omlx launch
omlx launch list shows what’s detected.
$ omlx launch list
Available integrations:
claude Claude Code (installed)
codex Codex (installed)
opencode OpenCode (not installed)
openclaw OpenClaw (not installed)
pi Pi (not installed)
copilot Copilot CLI (installed)
Looking at the big-3 LLM CLI coverage: Anthropic Claude Code and OpenAI Codex are in. OpenAI Codex from dev1, Microsoft GitHub Copilot CLI was added in dev2.
Google Gemini CLI is not supported — there’s no gemini.py under omlx-test/omlx/omlx/integrations/. I have /opt/homebrew/bin/gemini installed, so this isn’t “install detection failed”, it’s “no implementation”. Whether Gemini CLI itself is worth using is a separate debate, but there’s a real hole in big-3 CLI coverage.
Copilot CLI × Gemma 4 26B A4B
Copilot CLI is dev2’s new addition. After npm install -g @github/copilot, start it via omlx launch copilot --model gemma-4-26B-A4B-it-4bit --port 8765 --api-key omlxtest.
Under the hood, oMLX sets Copilot’s environment variables (COPILOT_PROVIDER_BASE_URL, COPILOT_PROVIDER_WIRE_API=responses, COPILOT_MODEL, etc.) and execs into the copilot binary.
Looking at the source (integrations/copilot.py), Copilot uses the responses endpoint (/v1/responses). oMLX supports that endpoint. There’s even a comment: “Copilot CLI appears to have issues with the completions endpoint, responses appears to work as expected”.
Running copilot -p "..." three times non-interactively:
| Run | Query | Input tokens | Cache hit | Wall time (26B A4B) |
|---|---|---|---|---|
| 1 | ”What is 2+2?“ | 26,627 | 0 | 1m 28s |
| 2 | ”What is 5*7?“ | 26,628 | 14,330 (54%) | 33s |
| 3 | ”What is 100/4?“ | 26,629 | 14,335 (54%) | 34s |
Copilot CLI sends all 26.6k tokens of system + tool definitions on every request. If you’re serious about running a local LLM as an agent backend, prefill time is the UX bottleneck.
oMLX 0.3.9.dev2’s paged SSD cache earns its keep here. 54% of the prompt becomes a cache hit, halving (and a bit more) prefill time. 88s → 33s, -63%.
Pointing the same Copilot CLI at 31B Dense changes the picture:
| Run | Query | Wall time (31B Dense) | vs 26B A4B |
|---|---|---|---|
| 1 | ”What is 2+2?“ | 4m 40s | 3.2× |
| 2 | ”What is 5*7?“ | 3m 16s | 5.9× |
| 3 | ”What is 100/4?“ | 3m 21s | 5.9× |
31B Dense takes 3-5 minutes per single response. SSD cache still helps (run 1 → run 2 is -30%), but absolute time is far from a daily-use feel.
For backing Copilot CLI with a local model on M1 Max 64GB, 26B A4B is the practical line, 31B Dense is for evaluation only.
Small gotchas:
- The chat/completions endpoint from Test 1 always returned
cached_tokens: 0, but the responses endpoint (Copilot’s) does return a proper cached token count. Different endpoints, different metric fill-in. Worth knowing. omlx launch copilotitself opens a curses TUI and dies on non-TTY stdin (Error: stdin is not a terminal). For CI or scripted parallel runs you’d need to skip the TUI. Here I assembled the env vars by hand and calledcopilotdirectly.- 88 seconds is genuinely long. If you point your usual Anthropic or hosted-Copilot timeouts at this, the timeout will trip. Expect to need an
--api-timeoutkind of override.
Codex × Gemma 4 26B A4B
Codex (OpenAI) has been supported since dev1. omlx launch codex writes a [model_providers.omlx] section into ~/.codex/config.toml.
I bypassed the TUI by writing config.toml manually and using codex exec:
model = "gemma-4-26B-A4B-it-4bit"
model_provider = "omlx"
[model_providers.omlx]
name = "oMLX"
base_url = "http://127.0.0.1:8765/v1"
env_key = "OMLX_API_KEY"
3 consecutive queries (after SSD cache clear):
| Run | Query | Input tokens | Wall time |
|---|---|---|---|
| 1 | ”What is 6*8?“ | 82 | 3.1s |
| 2 | ”What is 7*7?“ | 83 | 2.4s |
| 3 | Code generation (Fibonacci with memoization) | 163 | 4.4s |
The contrast with Copilot is dramatic. Codex sends maybe 82 tokens per request, compared to Copilot’s 26,627. About 300× fewer.
Where Copilot ships all tool definitions on every request, Codex starts with a minimal system prompt and only expands when a tool call needs to happen.
Wiring an agent onto a local LLM, Codex’s prefill pressure on UX is dramatically lower.
Gotchas:
- On startup Codex prints
failed to refresh available models: stream disconnected before completion: failed to decode models response: missing field 'models'. Codex expects{models: [...]}, oMLX returns the standard{object: "list", data: [...]}. Doesn’t break execution, but it shows up red in the log every time. codex exec --skip-git-repo-checkis needed outside a git repo.
Claude Code × Gemma 4 26B A4B
Claude Code (Anthropic) has been supported since dev1. omlx launch claude sets ANTHROPIC_BASE_URL, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_DEFAULT_OPUS_MODEL/SONNET_MODEL/HAIKU_MODEL, then execs into claude.
I ran it in --bare mode (which skips hooks, LSP, plugin sync, auto-memory, etc. — the minimum mode).
| Run | Query | Wall time |
|---|---|---|
| 1 | ”What is 2+2?“ | 8.22s |
| 2 | ”What is 5*7?“ | 2.07s |
| 3 | ”What is 100/4?“ | 2.03s |
From server logs, claude --bare’s system prompt is around 1,795 tokens. Heavier than Codex’s 82, dramatically lighter than Copilot’s 26,627.
| CLI | Prompt size (rough) | Run 1 | Run 2 | Run 2 reduction |
|---|---|---|---|---|
| Codex | 82 tokens | 3.1s | 2.4s | -23% |
| Claude —bare | 1,795 tokens | 8.2s | 2.1s | -75% |
| Copilot | 26,627 tokens | 88s | 33s | -63% |
Without --bare, Claude Code sends its full prompt (CLAUDE.md, MCP, hooks, etc.), so a realistic run lands somewhere closer to Copilot. I picked the minimum mode to see the server-side behavior cleanly.
Gotchas:
- You have to set
ANTHROPIC_API_KEY=""explicitly to an empty string, otherwise Claude Code tries Anthropic’s official auth and fails. Theomlx launch claudeimplementation explicitly sets an empty string. - Even with
--bare, all three ofANTHROPIC_DEFAULT_OPUS/SONNET/HAIKU_MODELneed to be set; otherwise internal routing may not pick the model you specified.
What this all shows
Practical takeaways:
- SSD cache works. Short prompts +24%, 1.5k tokens +208%, Claude —bare’s 1.8k tokens -75% on the second run, and Copilot CLI’s 26k tokens 88s→33s (-63%) on the second run. For coding-agent workloads, this is the easiest-to-feel Mac local inference speedup.
- VLM MTP should be left off. 26B A4B and 31B Dense both get slower across the board (long output -17 to -30%). Output diverges from baseline even at temperature=0. The “is this really working?” feeling sticks. Reasonable as a pre-release issue.
- Text-side MTP and DFlash are lossless. DFlash is roughly flat (+1.3%) on single-shot, around +2.3% over 10 turns. Its job is suppressing prefix-cache churn over long sessions. With 5 parallel requests, no extra DFlash effect appears — continuous batching alone is what produces the 1.64× throughput improvement.
- TurboQuant KV quantization didn’t move at 442 tokens. Should help on long context or heavy parallelism — out of scope here.
- Thinking budget produces empty output on Gemma 4. oMLX doesn’t detect “this model doesn’t support thinking” and lets you flip the switch anyway. Worth re-testing on actual thinking models (Qwen3.x etc.).
- oQ quantization fails on a 4-layer assistant draft model because sensitivity measurement can’t extract scores. You need a real bf16 main model to exercise it. The estimate endpoint does work correctly.
- specprefill is exposed in settings but pointing draft to
gemma-4-26B-A4B-it-assistant-bf16triggers a 404 from a Hugging Face download attempt, and the request silently falls back to the main model. Effectively a no-op. Looks like a pre-release bug. - Qwen2.5-VL 7B (Ollama) vs Gemma 4 26B A4B (oMLX): for “ask many questions about the same image” Gemma 4 is dramatically faster (warm 4s, cold 85s). For “process many different images” Qwen2.5-VL is steadier (24-30s per image). Gemma 4’s descriptions are more detailed (side ponytail, red bow at collar, etc.), but Qwen2.5-VL picks up the WAI mug and Apple Logo on OCR-heavy images, so it’s not one-sided.
- Cold-start tax: ~9-10s warmup on the first inference after load. Model load itself 10-12s. Total ~20-22s. Metal kernel JIT compilation is visible here.
- Copilot CLI × 26B A4B is workable: 1m 28s → 33s (-63%). Switching to 31B Dense takes 3-5 minutes per response, past the practical line. On M1 Max 64GB, 26B A4B is the realistic choice.
omlx launchworks for Copilot, Codex, and Claude Code. Prompt size differs by order of magnitude per CLI: Codex (~82) / Claude —bare (~1.8K) / Copilot (~26.6K). Even the same local LLM feels very different depending on agent CLI choice — measure token cost first.- Google Gemini CLI is the only big-3 LLM not supported. There’s no
gemini.pyunderomlx-test/omlx/omlx/integrations/. Wait for an OSS PR or roll your own. Whether Gemini CLI itself is great is a separate question, but the integration hole is real. - If you plan to run 26B/31B on Mac, 64GB unified memory is required. My setup auto-allocated
max_model_memory: 50.4GB,max_process_memory: 56GB. At 32GB you’d be picking a different model.
If you’re about to touch oMLX 0.3.9.dev2 yourself, here’s the “things that will trip you up first” list:
- Install: neither
pip install omlxnorbrew install omlxworks. Clone the source andpip install -e .. - Disk: SSD cache auto-reserves 92.6GB. Cap with
--paged-ssd-cache-max-size. - Metrics (TTFT, prefill_tps, generation_tps) only show up when streaming is on AND
stream_options.include_usage=true. - Admin API auth:
/admin/api/login+ Cookie, or Bearer. Not interchangeable — Bearer alone gets rejected by admin endpoints. - The
mtp_compatiblefield is text-side only. VLM MTP is a separate flag (vlm_mtp_enabled). - After changing settings, unload then load the model. Settings changes internally trigger an unload first, so a follow-up unload double-fires and returns 400 (
Model not loaded).
References