Ollama Moves to MLX Backend, Dramatically Speeds Up Local Inference on Apple Silicon
Ollama has switched its Apple Silicon backend from llama.cpp to MLX. It was released as a preview of Ollama 0.19 on March 30, 2026.
Until now, Ollama used llama.cpp (GGML) for CPU/GPU inference. To fully leverage Apple Silicon’s unified memory architecture, it makes sense to use Apple’s native MLX framework. A request for an MLX backend had been open on GitHub for a long time, and it has finally landed.
Benchmarks
Tests were run with Qwen3.5-35B-A3B (an MoE model with about 3B active parameters).
| Metric | Ollama 0.19 (MLX, NVFP4) | Ollama 0.18 (llama.cpp, Q4_K_M) | Improvement |
|---|---|---|---|
| Prefill | 1,810 tokens/s | 1,154 tokens/s | +57% |
| Decode | 112 tokens/s | 58 tokens/s | +93% |
With int4 quantization, the numbers are even higher: prefill 1,851 tokens/s and decode 134 tokens/s.
The near-doubling of decode speed stands out. Because decoding is memory‑bandwidth‑bound, this suggests MLX handles data placement on unified memory more efficiently than GGML. Prefill is compute‑bound, yet it still improves by 57%.
What Is MLX
MLX (Machine Learning eXploration) is an array‑computing framework Apple open‑sourced at the end of 2023. It offers a NumPy‑like API and is optimized for Apple Silicon’s unified memory architecture. Its key feature is eliminating CPU↔GPU data copies, which makes it fundamentally different from PyTorch’s MPS backend.
PyTorch’s MPS backend takes a “translate CUDA code to Metal” approach, which often runs into Apple‑Silicon‑specific issues such as the ComfyUI MPS non‑contiguous tensor issue and BF16 performance drops. MLX is designed specifically for Apple Silicon, so there’s no translation tax in the first place.
The MLX ecosystem includes Python (mlx), Swift (mlx-swift), and C/C++ APIs, plus the mlx-lm package for LLM inference. Ollama 0.19 is built on top of MLX.
Support for the M5 Chip’s Neural Accelerator
On M5, M5 Pro, and M5 Max chips, you can leverage the GPU‑integrated Neural Accelerator via MLX. According to Apple’s benchmark, the M5 Neural Accelerator handles matrix multiplication through Tensor Operations (TensorOps) introduced in Metal 4.
That benchmark compares a MacBook Pro M5 (24GB) against an M4.
| Model | M4 TTFT | M5 TTFT | Speedup |
|---|---|---|---|
| Qwen 1.7B (BF16) | 0.33 s | 0.15 s | 2.2x |
| Qwen 8B (BF16) | 3.02 s | 0.96 s | 3.1x |
| Qwen 14B (4bit) | 12.63 s | 4.24 s | 3.0x |
| Qwen 30B MoE (4bit) | 5.24 s | 2.70 s | 1.9x |
TTFT (Time to First Token) is a compute‑bound phase, so the Neural Accelerator’s effect shows up directly. Token generation speed is memory‑bandwidth‑bound, so the improvement is about 19–27%. Even so, it’s a clear step up over M4.
Given that Ollama’s official benchmark numbers were measured on an M5 system, earlier chips won’t reach the same figures, but the MLX backend’s benefits should apply across all Apple Silicon generations.
NVFP4 Quantization Support
Ollama 0.19 supports NVIDIA’s NVFP4 (4‑bit floating point) format. NVFP is a low‑precision inference data format developed by NVIDIA and is said to degrade accuracy less than INT4 or FP4.
ollama run qwen3.5:35b-a3b-coding-nvfp4
Although it bears NVIDIA’s name, NVFP4 itself is a data‑format specification rather than hardware‑dependent. As long as you have kernels that implement the math, it runs on any hardware. Recent MLX releases natively support NVFP4 and MXFP8 quantized ops on Metal.
The practical upside of NVFP4 is being able to run local inference with the same quantization format used on NVIDIA GPUs in the cloud. You can bring over models optimized with NVIDIA’s Model Optimizer as‑is, so server and local outputs match. This quietly matters when using coding agents like Claude Code or OpenClaw.
Cache Improvements
KV‑cache management also saw major improvements in Ollama 0.19.
Cache reuse across conversations. Previously the cache was discarded per conversation, but when using a shared system prompt it is now reused. Tools like Claude Code send the same system prompt every time, so the second and later prefills get much shorter.
Intelligent checkpoints. Snapshots of the cache are saved at appropriate positions in the prompt. This allows branching, agent‑like workflows to resume computation mid‑way.
Smarter eviction. Shared prefixes now survive longer even when old branches are evicted.
We dove deep on KV‑cache optimization in Hypura’s NVMe streaming and TurboQuant article, which targets cases where the model doesn’t fully fit in memory and must be streamed. Ollama 0.19’s cache changes focus on prompt‑processing efficiency assuming the model does fit, so they operate at a different layer. With both in place, the local LLM experience on Apple Silicon should feel much smoother.
What Changes When Moving From llama.cpp
graph TD
A[Ollama 0.18以前] --> B[llama.cpp / GGML]
B --> C[Metal経由でGPU実行]
C --> D[Q4_K_M等のGGUF量子化]
E[Ollama 0.19] --> F[MLX]
F --> G[統合メモリ直接アクセス]
G --> H[NVFP4 / int4量子化]
F --> I[M5 Neural Accelerator対応]
style A fill:#f5f5f5,stroke:#999
style E fill:#f5f5f5,stroke:#999
style F fill:#e8f5e9,stroke:#4caf50
style I fill:#e3f2fd,stroke:#2196f3
Currently supported models are limited. In this preview stage, the target is Qwen3.5-35B-A3B, and they plan to broaden architecture coverage over time. An import feature for custom models is also on the roadmap.
This stands in contrast to the approach covered in the Flash‑MoE article, which used hand‑written Metal shaders to cram a 397B model into a 48GB machine. Flash‑MoE focuses on “running models that don’t fit” via SSD streaming, while Ollama’s move to MLX optimizes for “running models that fit, faster.”
Setup
A Mac with 32GB or more of unified memory is required.
# Claude Codeで使う場合
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4
# OpenClawで使う場合
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4
# チャットで使う場合
ollama run qwen3.5:35b-a3b-coding-nvfp4
ollama launch is a command that connects the model to the specified application: under the hood it starts the Ollama server, loads the model, and passes an endpoint to the target app. In existing setups like building a RAG helpdesk with Ollama, you should be able to benefit from MLX simply by updating Ollama itself (as model coverage expands).
On Apple Silicon, options for local LLM inference now include llama.cpp’s Metal backend, PyTorch’s MPS backend, and MLX‑native.