Gemma 4 MTP drafter on M1 Max 64GB: 26B A4B +13%, 31B Dense and E4B got slower

I tested Gemma 4 MTP drafter on M1 Max 64GB with mlx-vlm 0.5.0.
The result didn’t match the official predictions: only the 26B A4B (MoE) got +13% faster, while the 31B Dense and E4B got slower.
This is the follow-up to my previous post that read the official blog and vLLM recipe, focused on actual numbers and prompt dependence.

Test environment

Apple M1 Max 64GB unified memory
macOS Sonoma, Python 3.13.11, mlx-vlm 0.5.0, mlx-lm 0.31.3
Inference backend: mlx-vlm (its CLI supports --draft-kind mtp --draft-block-size N)
Models from the mlx-community org as-is
- Base: gemma-4-E4B-it-bf16 / gemma-4-26B-A4B-it-4bit / gemma-4-31B-it-4bit
- Drafter: gemma-4-{E4B,26B-A4B,31B}-it-assistant-bf16 (all drafters distributed in bf16 only)

The vLLM recipe recommends tensor parallel 2 for 26B A4B and 31B, but Apple Silicon runs single-request batch size 1 here, so I’m running them as single-request without TP.
This intentionally hits the conditions that the official docs warn about (“MoE routing is difficult on Apple Silicon at batch 1”).

Prompts and conditions

The main test uses a long, structured code generation prompt:

Write a Python function called fibonacci_memoized that computes
the n-th Fibonacci number using memoization. Include type hints,
a docstring, and an example usage block at the bottom.

--max-tokens 256, drafter runs all use --draft-block-size 4 (vLLM recipe baseline).

A separate smoke test used a short, creative haiku prompt:

Write a haiku about the ocean.

This one runs with --max-tokens 50 and finishes around 20 tokens.
Every model got slower with MTP on this prompt.
Output length and content flip the conclusion, so this article shows both sets of numbers, with code gen as the main case and haiku as comparison.

E4B bf16: drafter overhead wins

E4B is a 4B base model, and the drafter is only 78M, so MTP should help if the drafter can land its proposals. The actual result is the opposite.

Condition	tokens-per-sec	Acceptance
baseline	29.6	—
MTP block_size=4	18.1	0.74 / 147 rounds
MTP block_size=6 (haiku)	12.9	0.62 / 13 rounds

Acceptance of 0.74 per round means out of 4 proposed tokens, only about 0.7 are accepted.
A lighter base model has lower per-token generation cost, so the drafter call cost and verify computation become relatively heavy.
The premise that drafters are “much smaller than the base” — laid out in the original speculative decoding papers and Google’s blog — works less well here because the base is already light enough.

The haiku acceptance rate is even lower.
Creative prompts have higher next-token unpredictability, so the drafter struggles to land proposals at all.

26B A4B 4bit was the only one that sped up — even though it’s MoE

The official blog and AI for Developers docs warn that MoE models stall at batch 1: each token may hit a different expert, so verifying multiple draft candidates loads extra expert weights per candidate, canceling the drafter’s gain.
The actual measurement contradicts this — 26B A4B was the only model out of three that sped up.

Condition	tokens-per-sec	Acceptance
baseline	57.4	—
MTP block_size=4	65.1	2.41 / 75 rounds

+13% throughput, with about 60% acceptance (2.41 out of 4 per round).
Code generation has predictable structure: def, return, if x is None:, and similar patterns are easy for the drafter to land.
The MoE expert-routing problem appears to get absorbed by the task characteristics of code generation.

26B A4B’s baseline of 57.4 tok/s is also strong on its own — Apple Silicon × MoE × 4bit quantization is a good combination here.
The MoE design that activates only 4B parameters per token plays well with unified memory bandwidth, which is most likely what’s happening underneath.

31B Dense 4bit got slower despite being “the model that benefits most”

31B Dense should be the model that benefits most from a drafter, because it reads 30.7B parameters per token, so reading 4–8 tokens ahead in parallel should pay off easily. That’s the official explanation.
Actual measurement:

Condition	tokens-per-sec	Acceptance
baseline	15.2	—
MTP block_size=4	14.3	2.37 / 76 rounds

−6% throughput, with about the same 60% acceptance as 26B A4B.
Slower despite a high acceptance rate — that’s the interesting part.

Breaking down the numbers: baseline produces 256 tokens in 16.8 seconds, MTP in 17.9 seconds.
Per-round drafter overhead is (17.9 − 16.8) / 76 ≈ 14ms, clearly larger than 26B A4B’s ~7ms.
Heavier target models also have heavier verification.
Even at 60% acceptance, the verify-included cost can’t be amortized, landing in slight regression.

The vLLM recipe recommends num_speculative_tokens 4–8 and tensor parallel 2 for 31B precisely to hide this verify cost behind GPU parallelism.
Apple Silicon single-request leaves little room for that kind of parallelization.

Where the official prediction and the measurement diverge

All three models, side by side:

Model	baseline tok/s	+ MTP tok/s	Delta	Acceptance	Official prediction
E4B bf16	29.6	18.1	−39%	0.74/r	drafter speeds it up
26B A4B 4bit	57.4	65.1	+13%	2.41/r	won’t gain at batch 1
31B 4bit	15.2	14.3	−6%	2.37/r	benefits the most

All three predictions inverted.
This isn’t a “the official docs are wrong” story — the test environment they assume (vLLM recipe = data-center GPU + tensor parallel + multi-request batching) and what I’m running (Apple Silicon + single request + 4bit quantization) are different enough that the priority order flips entirely.

flowchart LR
  A[Official assumption] --> B[H100/A100 + TP2 + multi-batch]
  B --> C[31B Dense gets max speedup]
  D[Actual setup] --> E[M1 Max + batch 1 + 4bit quantization]
  E --> F[26B A4B speeds up<br/>31B and E4B slow down]

Why only 26B A4B speeds up

Unified memory bandwidth fits 4bit-quantized MoE well.
26B A4B activates only 4B parameters per token by design, and 4bit quantization makes per-token weight reads small.
With MTP reading ahead, expert weights for verification can be loaded together, so memory I/O coalesces somewhat.

The drafter (0.4B bf16) to base (4B active in 4bit) size ratio also works here.
The drafter’s bf16 footprint isn’t too heavy relative to the active path of the base, so proposal cost stays within budget.

Then there’s the task characteristic of code generation.
def, return, ), newline + indent — many of these tokens are predictable patterns that the drafter can land.
The 60% acceptance rate is largely from this effect.

E4B fails because the base is too light for the drafter overhead to pay off.
31B fails because the target’s verify computation is heavy even at 4bit, and Apple Silicon single-request can’t distribute it the way TP2 would.
Both ends collapse, leaving 26B A4B in the middle as the only fit.

Short haiku prompts make even MoE slower

The smoke test with 20–50 token short haiku prompts shows all three models slowing down with MTP:

Model	baseline	+ MTP	block	Acceptance
E4B bf16	30.4	12.9	6	0.62 / 13
26B A4B 4bit	55.9	36.2	4	1.27 / 11
31B 4bit	15.6	9.2	4	1.33 / 9

Short outputs don’t generate enough tokens to amortize drafter overhead, and creative prompts drop acceptance rate further.
Code generation at 256 tokens and short haiku at 20 tokens flip the conclusion entirely.
Both “MTP makes it faster” and “MTP makes it slower” are true — strongly dependent on prompt length and content.

For search visitors asking “is Gemma 4 MTP faster or slower,” the answer is prompt-dependent. For short chat-style use, the regression risk dominates.

Things to confirm before trying this locally

For anyone wanting to repeat this on real hardware:

mlx-vlm 0.5.0 or later is required (check that --draft-kind mtp is present)
6 models total around 45GB of disk (3 drafters + 3 base models, including quantization). Watch free space
mlx-vlm prints Speculative decoding: X accepted tokens over Y rounds at the end of generation. No need to instrument acceptance rate manually
Optimal block_size differs per model. Reasonable starting points: 2 or 3 for E4B, 4 for 26B A4B, 4–6 for 31B
26B A4B at 4bit takes about 18GB, around 20GB total with the drafter. 64GB is comfortable, 32GB is tight

The 26B A4B result alone is worth trying on real hardware.
If you trust the official docs (“won’t gain at batch 1”), you’d skip the test — but on real hardware, for code generation, it’s the only source of speedup of the three.