oMLX 0.3.9.dev2 for Mac coding agents: Gemma 4 VLM MTP, DFlash, launch copilot

If you read oMLX 0.3.9.dev2 with the lens of “Mac local LLM as the backend for Claude Code, Codex, or Copilot,” the center of gravity in the additions becomes visible.
Gemma 4 VLM-side MTP, Gemma 4 now running through the DFlash engine, the new omlx launch copilot, SSD KV cache tuning, and a Restart Server button in the admin UI.
The v0.3.9.dev2 GitHub release is a pre-release, the second dev build heading toward 0.3.9.

oMLX itself is an MLX-based inference server for Apple Silicon.
It exposes both the OpenAI-compatible API and the Anthropic Messages API, and ships a menu-bar app, an admin UI, model management, continuous batching, and an SSD-backed KV cache as one package.
The README requires macOS 15.0+, Python 3.10+, and Apple Silicon.
Intel Macs and older macOS are not in scope.

A design that avoids recomputing the shared prefix every turn

Once you put a local LLM behind a coding agent, generation speed is not the first thing that hurts.
The system prompt, tool definitions, the repository description, and the recent dialogue all hang off every request, so prefill before generation gets heavy.

What oMLX leans on is reusing that prefill as a persisted KV cache.
The README describes hot blocks in RAM, cold blocks on SSD, and the cache itself persisted in safetensors format.
When the same prefix comes around again, the server restores from SSD instead of recomputing every token.

This direction is close to Hypura’s NVMe streaming and TurboQuant’s KV cache quantization that I wrote about earlier.
The difference is which part you stream: Hypura is more about model weights and FFN layers, while oMLX is about the runtime state of the agent conversation — the KV cache.
”Does the model fit in memory” and “does the long context get recomputed every turn” are two different bottlenecks.

The oMLX site notes that for coding agents, the KV cache gets invalidated many times within a single session.
If you just want to run a small model fast, Ollama or LM Studio is enough, but once you have long tool-laden conversations bouncing back and forth, prefill latency dominates perception before generation speed does. Whether you can hold on to that prefill decides the operational feel of pointing Claude Code or Codex at a local backend.

The headline of 0.3.9.dev2 is MTP on the Gemma 4 VLM side

The first item in the 0.3.9.dev2 release notes is MTP support for Gemma 4 image + text requests.
MTP stands for Multi-Token Prediction: predict several next tokens, then let the main model verify and accept them.
Conceptually it sits in the speculative-decoding family.

I already covered Gemma 4 MTP here twice, first with the official material and then with a hands-on on M1 Max 64GB.
In that hands-on, only 26B A4B got faster; 31B Dense and E4B got slower.
MTP is not “always faster when on” — it shifts with model architecture, batch, prompt, quantization, and runtime.

The dev2 change is that MTP now applies to the vision path too, not just text decode.
The release notes say toggling MTP in the admin UI causes the vision path to predict multiple tokens the same way as text.
It is off by default so existing behavior is preserved.

VLMs juggle image encoding, vision tokens, and language decoding, so the speed bottleneck is harder to spot than in text-only LLMs.
If you have Gemma 4 reading images for a real workflow, the right move is to flip MTP in the admin UI and measure short questions, long descriptions, and OCR-style reads separately.
As with text-side Gemma 4 MTP, a single benchmark won’t tell you the answer.

In dev2, you can run Gemma 4 through the DFlash engine.
DFlash is one of the fast inference paths around MLX, and oMLX treats it as an option in its engine pool.
The release notes say Gemma 4 no longer falls back when DFlash is selected — it sits in the same pool as other models.

Alongside that, dev2 adds DFlash-specific quantization choices, an FP16 draft-model boost, and a configurable prefix cache size.
This is the kind of tuning that keeps long sessions from churning out cache entries: rather than “fast on the first run,” it leans toward “still doesn’t fall apart after thirty minutes of use.”

On the quantization side, ParoQuant is supported and there is a custom quantization loader hook.
omlx serve, omlx chat, and the admin UI’s Load model now go through the same dispatcher, so new quantization methods can be added through loader branches.
Also new: oQ sensitivity measurement can auto-build a proxy model when the original doesn’t fit in RAM.

Apple Silicon is forgiving thanks to unified memory, but you still hit a wall when a large model can’t be loaded whole before quantizing.
With the proxy-model auto-build for oQ sensitivity measurement, even 64GB-class or 48GB-class Macs get a wider window into quantization work that used to be “won’t load, give up.”

`omlx launch` now covers Copilot too

dev1 added omlx launch claude, codex, opencode, openclaw, and pi. dev2 adds copilot.
Running omlx launch copilot opens a curses TUI, you pick a model, oMLX assembles the environment variables, and execs the CLI.

It’s a small thing, but it removes a lot of the painful glue when wiring local LLMs into agents.
Writing model name, base URL, API key, “is this Anthropic-compatible or OpenAI-compatible?”, and context-length handling into shell config every time is fragile.
The admin UI’s Applications dashboard also gets a Copilot row.

As I wrote in Bridging Ollama and local LLMs to MCP servers, what trips you up when wiring a local LLM into tool calls or MCP isn’t the model’s reasoning quality, it’s API compatibility, tool definitions, and connecting the running process.
oMLX folds that into one step — “pick a model, launch the CLI” — and the list of launch-supported CLIs is now six.

A Restart Server button closes the config-then-restart loop in one screen

dev2 also adds a Restart Server button in the admin UI.
You can restart the server from Settings, and the restart endpoint is protected by admin auth.
You don’t need to drop back to the menu bar or the shell to apply settings changes.

When you run an agent for a long stretch, you end up touching model directory, cache settings, quantization, and engine choice through the admin UI.
If only restart sits in the shell, it gets hard to track whether the running state matches what the UI is showing. With Restart Server in the same screen, change-then-confirm fits on one page.

The release also brings STT max_tokens support, non-English transcription language hints, a French admin UI, downloads stored in a Hugging Face-namespace layout, chat history renaming, and streaming completion logs.
But for the “Mac local LLM behind Codex or Copilot” angle, Gemma 4 VLM MTP, DFlash, oQ/ParoQuant, omlx launch copilot, and Restart Server are where you actually feel the difference when you use it.

How to measure dev2 without mixing signals apart

v0.3.9.dev2 is a pre-release, so the assumption is that you try it against a side model directory or a short verification session before swapping your main setup.
The release notes also say to open an issue if you run into bugs.

If you want to measure speed in oMLX, you have to separate the dimensions: TTFT, prefill tokens/sec, generation tokens/sec, cache-hit behavior on the second-and-later runs of the same prompt, and memory use.
SSD KV cache erasing prefill, MTP speeding up decode, and DFlash giving a different engine baseline are three different effects that get mixed if you measure them together.
Even with Gemma 4 text-side MTP, single-shot runs on Apple Silicon sometimes inverted the official benchmark direction. MTP, DFlash, and the quantization loader all behave differently across models, batch sizes, and prompt lengths.

References