Tech 6 min read

Ollama Moves to MLX Backend, Dramatically Speeds Up Local Inference on Apple Silicon

Ollama has switched its Apple Silicon backend from llama.cpp to MLX. It was released as a preview of Ollama 0.19 on March 30, 2026.

Until now, Ollama used llama.cpp (GGML) for CPU/GPU inference. To fully leverage Apple Silicon’s unified memory architecture, it makes sense to use Apple’s native MLX framework. A request for an MLX backend had been open on GitHub for a long time, and it has finally landed.

Benchmarks

Tests were run with Qwen3.5-35B-A3B (an MoE model with about 3B active parameters).

MetricOllama 0.19 (MLX, NVFP4)Ollama 0.18 (llama.cpp, Q4_K_M)Improvement
Prefill1,810 tokens/s1,154 tokens/s+57%
Decode112 tokens/s58 tokens/s+93%

With int4 quantization, the numbers are even higher: prefill 1,851 tokens/s and decode 134 tokens/s.

The near-doubling of decode speed stands out. Because decoding is memory‑bandwidth‑bound, this suggests MLX handles data placement on unified memory more efficiently than GGML. Prefill is compute‑bound, yet it still improves by 57%.

What Is MLX

MLX (Machine Learning eXploration) is an array‑computing framework Apple open‑sourced at the end of 2023. It offers a NumPy‑like API and is optimized for Apple Silicon’s unified memory architecture. Its key feature is eliminating CPU↔GPU data copies, which makes it fundamentally different from PyTorch’s MPS backend.

PyTorch’s MPS backend takes a “translate CUDA code to Metal” approach, which often runs into Apple‑Silicon‑specific issues such as the ComfyUI MPS non‑contiguous tensor issue and BF16 performance drops. MLX is designed specifically for Apple Silicon, so there’s no translation tax in the first place.

The MLX ecosystem includes Python (mlx), Swift (mlx-swift), and C/C++ APIs, plus the mlx-lm package for LLM inference. Ollama 0.19 is built on top of MLX.

Support for the M5 Chip’s Neural Accelerator

On M5, M5 Pro, and M5 Max chips, you can leverage the GPU‑integrated Neural Accelerator via MLX. According to Apple’s benchmark, the M5 Neural Accelerator handles matrix multiplication through Tensor Operations (TensorOps) introduced in Metal 4.

That benchmark compares a MacBook Pro M5 (24GB) against an M4.

ModelM4 TTFTM5 TTFTSpeedup
Qwen 1.7B (BF16)0.33 s0.15 s2.2x
Qwen 8B (BF16)3.02 s0.96 s3.1x
Qwen 14B (4bit)12.63 s4.24 s3.0x
Qwen 30B MoE (4bit)5.24 s2.70 s1.9x

TTFT (Time to First Token) is a compute‑bound phase, so the Neural Accelerator’s effect shows up directly. Token generation speed is memory‑bandwidth‑bound, so the improvement is about 19–27%. Even so, it’s a clear step up over M4.

Given that Ollama’s official benchmark numbers were measured on an M5 system, earlier chips won’t reach the same figures, but the MLX backend’s benefits should apply across all Apple Silicon generations.

NVFP4 Quantization Support

Ollama 0.19 supports NVIDIA’s NVFP4 (4‑bit floating point) format. NVFP is a low‑precision inference data format developed by NVIDIA and is said to degrade accuracy less than INT4 or FP4.

ollama run qwen3.5:35b-a3b-coding-nvfp4

Although it bears NVIDIA’s name, NVFP4 itself is a data‑format specification rather than hardware‑dependent. As long as you have kernels that implement the math, it runs on any hardware. Recent MLX releases natively support NVFP4 and MXFP8 quantized ops on Metal.

The practical upside of NVFP4 is being able to run local inference with the same quantization format used on NVIDIA GPUs in the cloud. You can bring over models optimized with NVIDIA’s Model Optimizer as‑is, so server and local outputs match. This quietly matters when using coding agents like Claude Code or OpenClaw.

Cache Improvements

KV‑cache management also saw major improvements in Ollama 0.19.

Cache reuse across conversations. Previously the cache was discarded per conversation, but when using a shared system prompt it is now reused. Tools like Claude Code send the same system prompt every time, so the second and later prefills get much shorter.

Intelligent checkpoints. Snapshots of the cache are saved at appropriate positions in the prompt. This allows branching, agent‑like workflows to resume computation mid‑way.

Smarter eviction. Shared prefixes now survive longer even when old branches are evicted.

We dove deep on KV‑cache optimization in Hypura’s NVMe streaming and TurboQuant article, which targets cases where the model doesn’t fully fit in memory and must be streamed. Ollama 0.19’s cache changes focus on prompt‑processing efficiency assuming the model does fit, so they operate at a different layer. With both in place, the local LLM experience on Apple Silicon should feel much smoother.

What Changes When Moving From llama.cpp

graph TD
    A[Ollama 0.18以前] --> B[llama.cpp / GGML]
    B --> C[Metal経由でGPU実行]
    C --> D[Q4_K_M等のGGUF量子化]

    E[Ollama 0.19] --> F[MLX]
    F --> G[統合メモリ直接アクセス]
    G --> H[NVFP4 / int4量子化]
    F --> I[M5 Neural Accelerator対応]

    style A fill:#f5f5f5,stroke:#999
    style E fill:#f5f5f5,stroke:#999
    style F fill:#e8f5e9,stroke:#4caf50
    style I fill:#e3f2fd,stroke:#2196f3

Currently supported models are limited. In this preview stage, the target is Qwen3.5-35B-A3B, and they plan to broaden architecture coverage over time. An import feature for custom models is also on the roadmap.

This stands in contrast to the approach covered in the Flash‑MoE article, which used hand‑written Metal shaders to cram a 397B model into a 48GB machine. Flash‑MoE focuses on “running models that don’t fit” via SSD streaming, while Ollama’s move to MLX optimizes for “running models that fit, faster.”

Setup

A Mac with 32GB or more of unified memory is required.

# Claude Codeで使う場合
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

# OpenClawで使う場合
ollama launch openclaw --model qwen3.5:35b-a3b-coding-nvfp4

# チャットで使う場合
ollama run qwen3.5:35b-a3b-coding-nvfp4

ollama launch is a command that connects the model to the specified application: under the hood it starts the Ollama server, loads the model, and passes an endpoint to the target app. In existing setups like building a RAG helpdesk with Ollama, you should be able to benefit from MLX simply by updating Ollama itself (as model coverage expands).

On Apple Silicon, options for local LLM inference now include llama.cpp’s Metal backend, PyTorch’s MPS backend, and MLX‑native.