Google's Gemma 4 launches in four sizes (E2B–A4B), publishing Gemini 3–derived reasoning under Apache 2.0

Update (2026-05-07): I tested the Gemma 4 MTP drafter (Multi-Token Prediction) on M1 Max 64GB. Only the 26B A4B (MoE) gained +13%, while the 31B Dense and E4B got slower — all three models inverted the official prediction. → Gemma 4 MTP drafter on M1 Max 64GB: 26B A4B +13%, 31B Dense and E4B got slower

Google has released Gemma 4, an open‑weights model family built with Gemini 3 technology, in four sizes at once. It ships under the Apache 2.0 license. On Hacker News it has drawn over 1,300 points and more than 390 comments.

The previous generation, Gemma 3, came only in a single 27B size. In the same period, Qwen rolled out task‑specific MoE models—a general‑purpose text model (Qwen3.5-35B-A3B), a coding‑focused model (Qwen3-Coder-Next), and an omni‑modal model (Qwen3-Omni). Kimi K2.5 chased the top spot with 1T parameters, and Sarvam covered the Indian market with 30B/105B. Google alone stuck to a single 27B, leaving the lineup looking thin.

With Gemma 4, Google fills the size gap in one shot, spanning from the 2B class for edge devices to 31B for desktops—four models at once. More interesting than the quantity, though, are the distinctive architectural choices they made for MoE.

The four‑model lineup

Model	Total parameters	Active parameters	Layers	Context length
E2B	5.1B (effective 2.3B)	2.3B	35	128K
E4B	8B (effective 4.5B)	4.5B	42	128K
26B A4B (MoE)	25.2B	3.8B	30	256K
31B Dense	30.7B	30.7B	60	256K

The “E” in E2B and E4B stands for Edge; inference targets low‑power devices such as smartphones, Raspberry Pi, and Jetson Nano. E4B also supports speech input and includes an audio encoder of about 300M parameters.

Different lineup strategies

This sizing strategy clearly diverges from competitors.

Qwen’s approach is “a single MoE framework hosting task‑specific models.” There are separate models for general text (Qwen3.5-35B-A3B), coding (Qwen3-Coder-Next), and omni‑modal (Qwen3-Omni). They keep active parameters around 3B while differentiating via training data and architectural details.

Gemma 4’s approach is “scale the same architecture across sizes.” A single design (Dense FFN in parallel with MoE + hybrid attention) runs from E2B up to 31B. Rather than task specialization, all models support vision and tool calling.

Which is better depends on use case, but the developer impact is clear. With Qwen you must pick a model for each task. With Gemma 4 you pick a size based on device compute. The latter lowers model‑selection overhead.

26B A4B’s competitive position

The most noteworthy model is 26B A4B. It’s an MoE (Mixture of Experts) design that activates only 3.8B parameters per token out of 25.2B total—aiming for 26B‑class accuracy at roughly 4B‑class inference cost.

That 3.8B active figure places it among peers as follows:

Model	Total parameters	Active parameters	Active fraction
Gemma 4 26B A4B	25.2B	3.8B	15.1%
Qwen3-Omni	30B	3.3B	11.0%
Qwen3-Coder-Next	80B	3B	3.8%
Sarvam 30B	30B	2.4B	8.0%
Kimi K2.5	1T	32B	3.2%

Where Qwen3‑Omni and Qwen3‑Coder‑Next cluster around 3B active, Gemma 4 is a bit higher at 3.8B. This stems from an architectural choice (parallel Dense FFN; see below), which shifts the inference‑accuracy tradeoff relative to other models.

MoE architecture in 26B A4B

MoE (“Mixture of Experts”) places multiple “experts” in the intermediate layers of a neural network, with a router choosing which experts to use for each input token. Because only part of the total parameters are used per token, inference is light for the model’s scale. Qwen3-Omni and Qwen3.5‑397B with Flash‑MoE also use MoE.

Gemma 4’s implementation departs from the common pattern in this area.

The “Dense‑FFN‑in‑parallel” design

Typical MoE models replace each Transformer layer’s FFN (feed‑forward network) with a bank of experts—“MoE instead of FFN.” Kimi K2.5 (384 experts + 1 shared), Sarvam 30B/105B (128 experts), and Qwen3‑Coder‑Next (10 active out of 512 experts) all follow this “replace FFN with MoE” pattern.

Gemma 4’s 26B A4B instead runs a Dense FFN (GeGLU, hidden=2,112) and a 128‑expert MoE (top‑8, each hidden=704) in parallel and sums their outputs. There is also one “shared expert” that every token always passes through.

graph LR
    A[入力トークン] --> B[Dense FFN<br/>GeGLU h=2112]
    A --> C[Router]
    C --> D[Top-8 Expert<br/>128中8選択<br/>各h=704]
    A --> E[Shared Expert<br/>常時アクティブ]
    B --> F[加算]
    D --> F
    E --> F
    F --> G[出力]

Each layer thus has both the “always‑available general expressivity” of a Dense FFN and the “input‑dependent specialization” of MoE. The structure is redundant: the Dense FFN can carry the layer by itself, with MoE adding accuracy on top.

This contrasts with Transformer‑Mamba hybrids like Nemotron Nano 9B and Holotron. Nemotron/Holotron combine “Transformer accuracy + SSM memory efficiency” (e.g., Holotron‑12B compresses long action histories with SSM while using Transformer for recent decisions). Gemma 4 instead combines “Dense FFN stability + MoE specialization.” Different architectures, same “take the best of two worlds” mindset.

Comparing MoE designs across models

Aspect	Gemma 4 26B A4B	Kimi K2.5	Qwen3‑Coder‑Next	Sarvam 30B
Experts	128	384	512	128
Top‑K	8	8	10	–
Shared expert	1	1	1	None
Dense FFN in parallel	Yes	No	No	No
Routing	–	–	–	sigmoid

Both Gemma 4 and Kimi K2.5 use Top‑8 routing. As noted in the Kimi K2.5 article, Top‑8 is a choice that favors expressivity compared to the more common Top‑2 (e.g., Mixtral).

Qwen3‑Coder‑Next (10 of 512 experts active) bets on “breadth of choice,” a different direction from Gemma 4’s 8 of 128. Sarvam pursues stability via sigmoid routing and expert load balancing.

Gemma 4’s Dense‑FFN‑in‑parallel acts as a fallback when MoE routing misses. In other models, routing accuracy directly drives output quality; in Gemma 4 the Dense FFN always guarantees a baseline of expressivity. In exchange for that stability you pay with more active parameters. The gap between Qwen3.5‑35B‑A3B’s 3B active and Gemma 4’s 3.8B is essentially this Dense‑FFN overhead.

So does the extra 0.8B pay off?

Benchmarks

From the model cards (31B‑it and 26B‑A4B‑it are instruction‑tuned variants):

Benchmark	31B	26B A4B	E4B	E2B
MMLU Pro (knowledge)	85.2%	82.6%	69.4%	60.0%
AIME 2026 (math; no tools)	89.2%	88.3%	42.5%	37.5%
LiveCodeBench v6 (coding)	80.0%	77.1%	52.0%	44.0%
GPQA Diamond (scientific knowledge)	84.3%	82.3%	58.6%	43.4%
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%

Vision benchmarks (models with vision encoders):

Benchmark	31B	26B A4B	E4B
MMMU Pro (multimodal understanding)	76.9%	73.8%	52.6%
MATH‑Vision	85.6%	82.4%	59.5%

Long‑context performance (MRCR v2 128K) shows a large gap: 31B at 66.4% vs. 26B A4B at 44.1%. Dense 31B has more global‑attention layers and thus an advantage in capturing long‑range dependencies.

From here, let’s compare by task.

Math: 88% with under 4B active

Model	AIME score	Active parameters
Kimi K2.5	96.1% (AIME 2025)	32B
Sarvam 105B	88.3% (no tools) / 96.7% (with tools)	–
Gemma 4 31B	89.2% (AIME 2026)	30.7B
Gemma 4 26B A4B	88.3% (AIME 2026)	3.8B
Sarvam 30B	80.0%	2.4B
Qwen3‑Omni	73.7% (Thinking)	3.3B

At 88.3%, Gemma 4 26B A4B is exceptional for 3.8B active—nearly 15 points above Qwen3‑Omni (73.7%) in the same ~3B band. Note, however, the AIME versions differ (Gemma 4: 2026; Qwen3‑Omni: 2025), so direct comparison needs caution.

Also notable: the gap to the 31B dense model is just 0.9 points. For math, the Dense‑FFN‑in‑parallel MoE pays off—3.8B active delivers nearly 31B‑class accuracy. Versus Sarvam 30B (2.4B active at 80.0%), the extra 1.4B active buys roughly +8.3 points; the Dense FFN cost clearly pays on math.

Coding: strong for a general model

Model	Score	Notes
Kimi K2.5	85.0% (LiveCodeBench v6)	32B active
Gemma 4 31B	80.0% (LiveCodeBench v6)	30.7B dense
Gemma 4 26B A4B	77.1% (LiveCodeBench v6)	3.8B active
Sarvam 105B	71.7% (LiveCodeBench v6)	–
Sarvam 30B	70.0% (LiveCodeBench v6)	2.4B active
Qwen3‑Coder‑Next	70.6% (SWE‑Bench)	3B active

77.1% for Gemma 4 26B A4B is very strong for a general model with 3.8B active. Considering that the coding‑specialized Qwen3‑Coder‑Next posts 70.6% (different benchmark, so not apples‑to‑apples), it’s impressive to get this far without specialization.

That said, coding‑agent usefulness isn’t captured by a single score. Cursor Composer 2 applies coding‑specific RL to Kimi K2.5 and reaches 61.3% on CursorBench, close to GPT‑5.4 Thinking (63.9%). For Gemma 4, a key question is how well it serves as a base for “RL‑boosted coding.” Given the Apache 2.0 license, third‑party RL‑tuned derivatives (à la K2.5) are very plausible.

Multimodal: similar accuracy, differentiated by breadth

Model	Score	Vision encoder
Kimi K2.5	78.5% (MMMU Pro)	MoonViT 400M
Gemma 4 31B	76.9% (MMMU Pro)	550M
Gemma 4 26B A4B	73.8% (MMMU Pro)	550M
Qwen3‑Omni Thinking	74.9% (MMStar)	SigLIP2 543M

Vision accuracy is broadly similar. But Gemma 4’s 550M vision encoder targets text+image only, while Qwen3‑Omni uses a similarly sized SigLIP2 (543M) to cover image+audio+video. If accuracy is comparable, Qwen3‑Omni wins on modality breadth.

Gemma 4, however, standardizes vision across all four models—even E2B (5.1B) can understand images. Qwen3‑Omni is a single 30B model needing 69GB VRAM, so for image understanding on edge devices Gemma 4 faces little competition. “Vision on small models” is a clear strength.

Knowledge: raw parameter count matters

Model	Score
Sarvam 105B	90.6% (MMLU)
Gemma 4 31B	85.2% (MMLU Pro)
Sarvam 30B	85.1% (MMLU)
Gemma 4 26B A4B	82.6% (MMLU Pro)

MMLU Pro and MMLU differ (the former is harder), but the trend is readable. Knowledge tasks show a 2.6‑point gap between 31B and A4B—larger than the 0.9‑point math gap. Breadth of knowledge tends to depend on total parameters; active parameters alone don’t fully compensate.

Interestingly, Sarvam 30B (2.4B active) posts 85.1%, essentially tied with Gemma 4 31B dense (85.2%). Sarvam is trained on India‑specific corpora like Common Crawl India and Bharat‑Text; coverage differs, but parameter efficiency sometimes favors Sarvam.

Where the extra 0.8B from the Dense FFN pays off

Across benchmarks, the additional 0.8B active parameters in Gemma 4 26B A4B (the Dense‑FFN portion) help most in math and coding, less in pure knowledge tasks. The harder and less stable the routing (complex reasoning), the more the Dense‑FFN fallback seems to help.

Attention design

All Gemma 4 models alternate local sliding‑window attention with global full attention layer by layer.

Attention type	E2B / E4B	26B / 31B
Local window	512 tokens	1,024 tokens
Global layers	Proportional RoPE	Proportional RoPE

Sliding‑window attention limits tokens to attend only to the most recent N tokens, making compute scale linearly with context length. It can’t capture far‑away information, so global‑attention layers are interleaved to gather document‑level context.

Global layers use Proportional RoPE (p‑RoPE). Standard RoPE loses high‑frequency fidelity as positions grow distant; p‑RoPE scales frequency bands proportionally between global and local layers to preserve positional quality over 256K tokens.

Approaches to long‑context processing

In early 2026, major models diverge in their approaches to long context:

Gemma 4: alternating local+global layers (256K)
All layers are full attention, but local (last 1,024 tokens) and global (entire sequence) alternate. Simple, but global layers incur full‑attention cost.
Qwen 3.5: SSM+Attention hybrid (128K)
Qwen3.5‑35B‑A3B uses 30 of 40 layers as SSM (Mamba‑family), so only 10 attention layers consume KV cache. Expanding ctx‑size from 4K to 64K increases VRAM by only ~800MB; KV growth with context is inherently smaller.
Mamba‑3: pure SSM (16K+)
Mamba‑3 removes attention entirely and runs with a fixed‑size state vector, achieving ~6.9× the speed of a Transformer at 16,384 tokens—at the cost of “forgetting details,” which hampers precise long‑range retrieval.
Proprietary: brute‑forcing 1M tokens
Claude 1M context reaches frontier scores on MRCR v2 and is integrated into the standard API with no surcharge. GPT‑5.4 also supports 1M tokens (2× pricing beyond 272K). Cloud‑scale infrastructure makes 1M possible.

Gemma 4’s 256K is ample for open‑weights models but shy of proprietary 1M. Claude/GPT‑5.4 assume cloud inference, whereas Gemma 4’s 256K can run locally. At MRCR v2 128K, 31B scores 66.4% and 26B A4B 44.1%—strong results among open models for long context.

Gemma 4 also reuses Key/Value tensors across the final N layers (KV‑cache sharing) to cut inference memory. Whereas SwiftLM’s TurboQuant and Hypura’s KV‑cache compression reduce bits per entry, Gemma 4 reduces the number of entries by “sharing KV across layers.” Different approach, same acknowledgement that KV cache is the memory bottleneck.

Vision and audio encoders

All models accept vision input. The vision encoder has ~150M parameters in E2B/E4B and ~550M in 26B/31B. Only E4B supports audio input, with an ~300M audio encoder.

Multimodal support compared

Model	Text	Image	Audio in	Audio out	Video	Design philosophy
Gemma 4 (all)	Yes	Yes	E4B only	No	No	Vision standard at all sizes
Qwen3‑Omni	Yes	Yes	Yes	Yes (real time)	Yes	Single model that covers all
Kimi K2.5	Yes	Yes	No	No	Yes	Emphasis on text+vision
Nemotron Nano 9B	Yes	No	No	No	No	Text‑first + tools

Qwen3‑Omni even handles audio output (Talker) and video input, easily out‑scoping Gemma 4 in multimodal breadth. Its audio encoder (AuT) has 650M parameters trained on 20M hours of supervised data, far larger than Gemma 4 E4B’s ~300M.

Nemotron Nano 9B goes the other way, dropping multimodality to focus on text and tool calling. It tops the 10B‑and‑under category on Nejumi Leaderboard 4 by concentrating resources on text.

Gemma 4 sits between them. It’s not as broad as Qwen3‑Omni, nor as text‑only as Nemotron. “Vision standard across all models” is sensible in that even a small model like E2B (5.1B) can understand images. Given Qwen3‑Omni’s 69GB VRAM needs and Kimi K2.5’s 32B active, Gemma 4 is the only realistic choice for edge‑device image understanding.

Gemma 4 supports 140 languages with a shared vocabulary of 262K tokens across all models (slightly up from Gemma 3’s 256K). In contrast to Sarvam, which targets 22 Indian languages and chases a low fertility score with a custom tokenizer, Gemma 4 takes the general multilingual route. Language‑specialized paths (e.g., Nemotron Nano 9B for Japanese) exist, but Gemma 4 competes on breadth: 140 languages.

Real‑world performance

NVIDIA GPU environment

HN comments include some measured reports. On an RTX 4090 (24GB VRAM), 26B A4B is reported around 150 tokens/s, roughly 50% faster than Qwen 3.5‑35B‑A3B at ~100 tokens/s under the same conditions.

Don’t read that gap too quickly as superiority. In the Qwen3.5‑35B‑A3B article, we saw 53 t/s (Q6_K) on Vulkan—but on an AMD Radeon 8060S, not an RTX 4090. Also, Qwen3.5‑35B‑A3B’s SSM+Attention hybrid maintains VRAM and speed very well as context grows; it held 53.6 t/s even at 64K context. We don’t yet have data on how Gemma 4 26B A4B degrades with context.

The 31B dense model outperforms 26B A4B on accuracy but is slower, as expected. Choose 31B for accuracy, or 26B A4B for a speed/accuracy balance.

Apple Silicon environment

Data are still sparse, but some references exist. With Ollama 0.19’s MLX backend, Qwen3.5‑35B‑A3B reaches 112 tokens/s decode on an M5 chip (NVFP4 quantization). Gemma 4 26B A4B is a similar MoE with comparable active parameters (3.8B vs. 3B), so once MLX support lands, we can expect a similar ballpark.

However, Ollama 0.19’s MLX backend is a preview focused mainly on Qwen3.5, and Gemma 4’s new architecture (Dense‑FFN‑parallel MoE) may take time to support. Flash‑MoE got a 397B model onto a 48GB MacBook via SSD streaming, but that also required per‑model Metal shader support.

Maturity of inference frameworks

Some early‑stage issues have been reported. LM Studio and Llama.cpp had chat‑template parsing bugs that caused 31B to output ---\n regardless of input. Details like EOS token <turn|>, and recommended settings of temperature=1.0, top_p=0.95, top_k=64, show that existing inference frameworks need updates to fully support the models.

This happens with every new architecture. Sarvam 30B also awaited a merge for its sarvam_moe architecture in llama.cpp. Tools that use llama.cpp under the hood, like AMD Lemonade, end up waiting on upstream. Unsloth (Daniel Chen) published quantized variants within hours, but broader framework stability will take a bit longer.

Edge deployment with E2B/E4B

E2B and E4B target fully offline inference on phones and single‑board computers; Raspberry Pi and Jetson Nano support is explicitly called out.

Edge‑model options

Model	Parameters	Vision	Audio	Strengths
Gemma 4 E2B	5.1B (effective 2.3B)	Yes	No	Smallest across modalities; includes vision
Gemma 4 E4B	8B (effective 4.5B)	Yes	Yes (input only)	Offline speech+vision processing
Nemotron Nano 9B	9B	No	No	Best Japanese; strong tool use
Holotron‑12B	12B	Yes	No	PC‑operation specialist; 8,900 t/s on H100

Gemma 4 E2B/E4B’s edge differentiator is “vision at the smallest size.” Nemotron Nano 9B focuses on text+tools and leads the ≤10B class on Nejumi Leaderboard 4, but lacks vision. If you don’t need images (pure text agents), Nemotron wins; if you need to handle phone‑camera input, Gemma 4 is the only option.

Holotron‑12B is a Nemotron‑based Transformer‑Mamba hybrid specialized for PC control. It’s very fast (8,900 tokens/s on H100) but, at 12B parameters, not a smartphone target—think edge/server GPUs instead.

Memory constraints in practice

Memory is the main constraint for edge LLMs. SwiftLM ran Qwen3 1.7B on an iPhone 13 Pro (6GB), but Gemma 4 E2B is 5.1B (2.3B active) with vision, which demands more memory.

A mini‑PC with 64GB unified memory like EVO‑X2 (Strix Halo) is comfortable, but 6–8GB phones are tight; practicality will depend on quantization and inference‑framework optimizations. As AMD Lemonade showed—its NPU inference (up to 60 TOPS) only made small models truly practical—edge inference still runs into hard compute ceilings.

E4B’s audio encoder enables on‑device speech recognition + text responses. HN commenters note that complex tasks like SVG generation hit the limits of the 2B/4B classes; these are best used for “understand vision+audio inputs,” while richer text generation still favors Nemotron 9B or Qwen3‑8B.

Tool calling and reasoning modes

All models support Function Calling (tool use) and a Thinking mode (chain of reasoning). Thinking consumes tokens but can raise math and coding scores substantially.

Agent capabilities compared

HN users tested timestamp calculations with 26B A4B and saw attempts to call tools followed by hallucinated results. On the same task, Qwen 3.5 manually computed the correct answer; tool‑calling reliability is still maturing.

Model	Agent results
Kimi K2.5	PARL (parallel‑agent RL) hits 78.4% on BrowseComp; Agent Swarm coordinates 100 sub‑agents
Cursor Composer 2	Kimi K2.5 + coding‑RL; 61.3% on CursorBench, close to GPT‑5.4 Thinking (63.9%)
Sarvam 105B	68.3% on the Tau2 agent benchmark
Nemotron Nano 9B	Best ≤10B in Nejumi tool‑use
Gemma 4	Tool‑calling capable; reliability TBD

Kimi K2.5’s PARL (Parallel‑Agent RL) coordinates 100 sub‑agents to parallelize complex browsing tasks. Cursor Composer 2 stacks coding‑specific RL on Kimi K2.5 to approach GPT‑5.4 Thinking on coding.

Gemma 4 supports tool use as a base model, but these specialized agent systems are further along. The potential is clear from benchmarks, yet agent‑use evaluations are still to come. Given Apache 2.0, Gemma 4 may realistically serve as a base for third‑party RL (like Composer) rather than shipping a fully tuned agent itself.

Distribution and licensing

Distribution channels include Hugging Face, Ollama, Kaggle, LM Studio, and Docker. Training frameworks include JAX, Keras, Vertex AI, and Google Kubernetes Engine. The license is Apache 2.0—a major shift from Gemma 3’s custom license. Commercial use is unrestricted and publishing derivatives is allowed.

Model	License	Derivative restrictions
Gemma 4	Apache 2.0	None
Kimi K2.5	MIT	None
Qwen3‑Coder‑Next	Apache 2.0	None
Sarvam 30B/105B	Open source	None
Qwen3.5	Apache 2.0	None
Nemotron Nano 9B	NVIDIA Open Model License	Some restrictions

Moving from Gemma 3’s custom license to Apache 2.0 is a clear signal lowering the barrier to commercial use. Kimi K2.5 is the most permissive (MIT). Qwen models and Gemma 4 align on Apache 2.0. NVIDIA’s Nemotron uses NVIDIA’s own license, which is less free than Apache/MIT.

Cursor Composer 2 shows how permissive licensing fosters derivative ecosystems: it applied RL to Kimi K2.5 (MIT) as a base. With Gemma 4 under Apache 2.0, expect derivatives specialized for coding or Japanese to emerge.

Training data cutoff: January 2025.

The open‑weight MoE landscape

In early 2026, the open‑weights MoE market is crowded, and design philosophies stand out:

Qwen: task‑specific MoE lineup. General text (Qwen3.5‑35B‑A3B), coding (Qwen3‑Coder‑Next), and omni‑modal (Qwen3‑Omni)—specialized models built on a shared MoE framework. An SSM+Attention hybrid pursues memory efficiency for long context. The Ollama MLX backend also targeted Qwen3.5 first, giving it an inference‑ecosystem head start.

Kimi: brute‑force giant MoE. K2.5 aims for benchmark leadership with 32B active out of 1T total. AIME 96.1% and BrowseComp 60.2% show that piling on parameters works. As the base for Cursor Composer 2, it’s now the “best starting point for RL.”

Sarvam: region focus + inference efficiency. 30B/105B specialize in 22 Indian languages and stabilize training with sigmoid routing. On H100, measured throughput hits 3–6× the Qwen3 baseline—aggressively optimized for inference.

NVIDIA: text focus + language specialization. Nemotron Nano 9B is a Transformer‑Mamba hybrid that leads Japanese performance, with up to 6× the throughput of similar‑size open models. The “9B is enough” philosophy competes directly with Gemma 4 E2B/E4B for the edge.

Gemma 4: stability via Dense FFN alongside MoE. Running Dense FFN in parallel with MoE makes routing failures less harmful. Active parameters are somewhat higher (3.8B), but Gemma 4 tops Qwen3‑Omni in math (AIME 88.3%) and beats Sarvam 30B in coding. Launching four sizes at once, it covers edge to desktop—a clear counter to Qwen’s task‑specific lineup.

No one beats Kimi K2.5 on math, and no one beats Gemma 4 E2B on edge vision. For raw inference efficiency look to Sarvam; for Japanese to Nemotron; for multimodal breadth to Qwen3‑Omni. Gemma 4 differentiates by balanced strength across axes and a coherent lineup that carries one design from E2B up to 31B.