Tech 9 min read

Gemma 4 MTP drafter: 3x speedup for Dense, limited gains on 26B MoE at batch 1

IkesanContents

Update (2026-05-07): I tested this on M1 Max 64GB. Only the 26B A4B (MoE) gained +13%, while the 31B Dense and E4B got slower — all three models inverted the official prediction. → Gemma 4 MTP drafter on M1 Max 64GB: 26B A4B +13%, 31B Dense and E4B got slower

Google released Multi-Token Prediction drafters for Gemma 4.
The official blog says “up to 3x faster with no quality loss,” but actual speedup varies by model size and execution environment.
This is not hands-on benchmarking — it’s notes from reading the official blog, Google AI for Developers MTP documentation, and vLLM recipes.

The Gemma 4 model family article covered the 26B A4B MoE architecture, 31B Dense, and E2B/E4B edge models.
The MTP drafter is not a re-release of the base model. It’s an auxiliary model that attaches to existing Gemma 4 for inference.

An auxiliary model that reads ahead instead of waiting one token at a time

Standard LLM generation is autoregressive: produce one token, feed it back, produce the next.
Simple, but each token requires reading the full model weights from memory. GPU compute often sits idle while memory bandwidth becomes the bottleneck.

The MTP drafter is a lightweight auxiliary model that predicts several tokens ahead at once.
The Gemma 4 base model then verifies those candidates in parallel. Accepted tokens are emitted together.
Where the draft is wrong, the base model produces the correct token. Output quality stays identical to standard generation — that’s Google’s claim.

This is speculative decoding, but the Gemma 4 drafter is not a fully independent small model.
According to Google’s documentation, it shares input embeddings with the base model, uses the base model’s final-layer activations, and shares the KV cache.
It’s closer to an auxiliary head riding on the base model’s intermediate results than a standalone context-reading assistant.

flowchart LR
  A[Input + past context] --> B[Gemma 4 base]
  B --> C[Final-layer activations]
  C --> D[MTP drafter]
  D --> E[Propose next N tokens]
  B --> F[Verify proposals in parallel]
  E --> F
  F --> G[Emit accepted tokens]

For E2B and E4B, instead of computing logits over the full 262K vocabulary every time, centroids masking narrows candidates to roughly 4K.
The vLLM recipe says this makes lm_head computation about 45x lighter.
Smaller models spend a larger fraction of compute on the final logit layer, so a dedicated optimization for E2B/E4B makes sense.

The 3x claim is not uniform across all models

Google’s blog shows tokens-per-second improvements across LiteRT-LM, MLX, Hugging Face Transformers, and vLLM.
The MTP drafter itself is Apache 2.0 licensed and available via Hugging Face, Kaggle, MLX conversions, vLLM, SGLang, Ollama, and Google AI Edge Gallery.

The 26B A4B has a catch, though.
Google AI for Developers’ MTP documentation explains that in MoE models, each token may hit different experts. When verifying multiple draft candidates, additional expert weight loads are triggered, which can cancel out the drafter’s speed gains.
At batch size 1, expert overlap between candidates is low, and speed barely improves in environments with limited parallelism.

The official blog adds that 26B MoE routing is difficult on Apple Silicon at batch size 1. Increasing batch size to 4–8 brings local speedup to roughly 2.2x.
NVIDIA A100 shows similar improvement with larger batch sizes, suggesting server deployments that batch multiple requests will benefit more than interactive single-request use.

Compare this with Qwen3.5-35B-A3B maintaining speed as ctx-size grew.
Qwen3.5’s SSM+Attention hybrid shrinks the KV cache itself.
Gemma 4 MTP reduces how often the heavy base model runs per token.
Both reduce inference bottlenecks, but at different levels.

The vLLM recipe lists MTP assistant models for E2B, E4B, 26B A4B, and 31B.
Recommended num_speculative_tokens is 2 for E2B, 4 for E4B and 26B A4B, and 4–8 for 31B.
Tensor parallel 2 is also recommended for 26B A4B and 31B.

Slower target models make the drafter’s read-ahead more worthwhile.
31B Dense reads 30.7B parameters per token, so the drafter’s overhead is easily recovered even at 4–8 speculative tokens.
E2B’s base model is light enough that too many speculative tokens makes the drafter’s overhead noticeable.

26B A4B sits in between on paper, but MoE routing complicates things.
vLLM’s recommended values were measured on A100/H100. The same numbers won’t necessarily apply on a Mac, a consumer GPU, or Ollama with quantized models.
More num_speculative_tokens doesn’t automatically mean faster — the sweet spot depends on acceptance rate, drafter compute cost, and expert weight loading.

Long output is where local users see the difference

The MTP drafter doesn’t magically cut Time to First Token.
It speeds up the decode phase after the first response token is produced.
For short chats returning 20 tokens, the difference is barely noticeable. For code generation, long summaries, or multi-step local agent output, the wait time adds up.

This also applies to local LLMs running MCP or tool calling.
As covered in Ollama + local LLMs need a bridge for MCP servers, tool definitions and tool results inflate context and output length.
With Gemma 4’s tool calling and thinking mode enabled, generated token counts grow well beyond a typical chat turn.

MTP is not a zero-cost addition to memory, though.
The assistant model loads alongside the target model, and vLLM recommends tensor parallel 2 for both 31B and 26B.
On a single 24GB VRAM GPU or Apple Silicon unified memory, something has to give — model quantization, context length, or batch size.

Why Gemma went MoE

Gemma 4’s 26B A4B going MoE wasn’t a sudden pivot.
Google has been a MoE pioneer since the 2017 Sparsely-Gated MoE Layer paper (Shazeer et al.), followed by GShard (2020), Switch Transformer (2021), and GLaM (2022 — 1.2T parameters matching GPT-3 quality at 1/3 the compute).
The Gemini series is also widely understood to use MoE architectures.

If anything, Gemma 1 through 3 being Dense-only for three generations was the oddity. Google likely held off until their open-weight MoE design was mature enough.
The Gemma 4 model family article details the parallel Dense FFN + MoE execution that absorbs routing instability — a design unique to Gemma among MoE models.
That design reflects lessons from GLaM and Switch Transformer on how to handle MoE’s weaknesses.

Mistral’s Mixtral, Qwen’s MoE series, and Kimi K2.5’s 1T MoE all appeared after 2024, each favoring fewer active parameters to cut inference cost.
Gemma joined with Gemma 4.

MoE and autoregression are orthogonal

When discussing MTP it’s easy to conflate “MoE is slow” with “Dense is faster,” but autoregressive generation and MoE are independent concepts.

Autoregression is the token generation method: look at previous tokens, emit the next one, feed it back.
Gemma 4, Qwen3.5, Kimi K2.5, Llama, GPT — all autoregressive during generation.
The MTP drafter targets the bottleneck of running the full base model for every single token. Whether that model is MoE or Dense is a separate question.

MoE is about compute distribution within each forward pass.
Processing one token either activates all parameters (Dense) or only router-selected experts (MoE). The autoregressive generate-then-feed-back loop is the same either way.

So why does MoE blunt MTP’s speedup?
As discussed earlier, the issue is in the speculative decoding verification phase.
When the base model verifies multiple draft candidates in parallel, each candidate may activate different experts.
A Dense model uses the same weights for all candidates — one memory read suffices. MoE may need to load different expert weights per candidate.
It’s not “MoE makes autoregression slower.” It’s “MoE increases memory I/O during speculative parallel verification.”

Dense vs MoE gap on reasoning tasks

Gemma 4 ships both 31B Dense and 26B A4B MoE, making this a rare same-generation architecture comparison.
Focusing on reasoning benchmarks from the model family article:

AIME 2026 (math reasoning): 31B scores 89.2%, 26B A4B scores 88.3%. Only 0.9 points apart.
Math problems have clear structure where each reasoning step uses similar computation types, keeping MoE routing stable.
The parallel Dense FFN layer absorbs routing fluctuations.

LiveCodeBench v6 (coding): 80.0% vs 77.1%, a 2.9-point gap.
Coding tasks span a wider range than math, making routing variance more likely to show. Still, 77% from only 3.8B active parameters is strong.

BigBench Extra Hard (diverse complex reasoning): 74.4% vs 64.8%, a 9.6-point gap.
The task diversity here means “picking the right expert” matters much more. The Dense FFN can’t absorb all routing misses across this range.

Long context (MRCR v2 128K): 66.4% vs 44.1%, a 22+ point gap.
This is less about MoE and more about global attention layers — 31B has 60 layers vs 26B A4B’s 30. Long-range information retrieval depends directly on layer depth and global attention coverage.

Single-topic deep reasoning (math, code) lets MoE approach Dense quality.
Diverse reasoning and long-context tasks favor Dense.
Adding the MTP drafter doesn’t change this quality gap — when drafts are rejected, the base model produces the correct token, keeping output identical to standard generation.
What changes is speed: MoE’s lower draft acceptance rate reduces the throughput gain. The batch-1 MoE stall is this structural issue at work.

References