Tech 16 min read

Gemma 4 12B Unified: 35M linear projection replaces 150M 16-layer Vision Encoder

IkesanContents

Gemma 4 12B Unified, unlike E2B/E4B/31B in the same family, has neither a Vision Encoder nor an Audio Encoder.
The image processing that required a 150M-parameter 16-layer Transformer in E4B is reduced to a single 35M linear projection in 12B Unified.
Patch-level feature extraction is absorbed by bidirectional attention across the LLM body’s 48 layers.

This is an encoder-free VLM design explored by Adept Fuyu (2023) and EVE (NeurIPS 2024), though each hit different problems.
This post traces what Gemma 4 12B Unified kept and what it changed, using prior encoder-free research as reference points.

E4B’s image processing

E4B passes images through a dedicated Vision Encoder before feeding them to the LLM, the same two-stage architecture as CLIP/SigLIP.

// E4B Vision Encoder (config.json excerpt)
{
  "hidden_size": 768,
  "num_hidden_layers": 16,
  "num_attention_heads": 12,
  "num_key_value_heads": 12,
  "head_dim": 64,
  "patch_size": 16,
  "pooling_kernel_size": 3,
  "hidden_activation": "gelu_pytorch_tanh"
}

Images are first split into 16x16-pixel patches and downsampled by a 3x3 pooling kernel.
The resulting patch sequence passes through a 16-layer Transformer (hidden_size 768).

What those 16 layers do is self-attention-based feature extraction across patches.
Early layers capture local edges and textures; deeper layers build part-level and spatial relationships spanning multiple patches.
As confirmed in CLIP ViT-L/14 research, Vision Transformers hierarchically build low-level visual features in earlier layers and high-level semantic features in later layers.
2D RoPE gives each patch XY coordinate information, letting patches maintain spatial relationships with their neighbors.

This processing uses 150M parameters. The output is projected into the LLM’s embedding space (2,560 dimensions) via a projector layer and concatenated with text tokens as LLM input.
For 31B Dense, the same architecture inflates the Vision Encoder to roughly 550M parameters.

12B Unified’s image processing

12B Unified has no 16-layer Transformer.
The architecture class is Gemma4UnifiedForConditionalGeneration, distinct from E4B’s Gemma4ForConditionalGeneration.

// 12B Unified Vision Embedder (config.json excerpt)
{
  "model_type": "gemma4_unified_vision",
  "mm_embed_dim": 3840,
  "output_proj_dims": 3840,
  "patch_size": 16,
  "pooling_kernel_size": 3,
  "mm_posemb_size": 1120,
  "num_soft_tokens": 280
}

The entry point is the same as E4B: split into 16x16 patches and apply 3x3 pooling, producing effective 48x48-pixel super-patches.
From here, things diverge.

Each patch becomes a 48x48x3 = 6,912-dimensional vector.
This is linearly projected by a single [6,912, 3,840] weight matrix. Parameter count is roughly 26.5M.
No Transformer layers, no attention. Each patch is independently mapped from raw pixel values to the LLM’s hidden space (3,840 dimensions).

After projection, a factorized positional embedding is added.
Two lookup tables for X-axis [1,120, 3,840] and Y-axis [1,120, 3,840] provide vectors corresponding to each patch’s XY coordinates.
Combined parameter count is roughly 8.6M.
This is similar in concept to the factorized RoPE adopted by Qwen2-VL, decomposing 2D spatial position information into X and Y axes for efficient representation.

LayerNorm is applied, and the result is directly mixed into the LLM’s input token sequence.
Total parameter count is roughly 35M: 26.5M for the projection matrix + 8.6M for positional embeddings + LayerNorm and others.
That is 4.3x smaller than E4B’s 150M and 8.6x smaller than CLIP ViT-L/14 (303M).

graph LR
    subgraph encoder["E4B / 31B Dense"]
        A1["16x16 patches<br/>+ 3x3 pooling"] --> B1["Vision Encoder<br/>16-layer Transformer<br/>150M params"]
        B1 --> C1["Project to LLM"]
    end
    subgraph free["12B Unified"]
        A2["16x16 patches<br/>+ 3x3 pooling"] --> B2["Linear projection<br/>single matmul<br/>35M params"]
        B2 --> C2["Direct LLM input"]
    end

Image token budget

The number of image tokens is controlled by num_soft_tokens.
The default is 280 tokens, selectable from five levels: 70/140/280/560/1,120.

Soft tokensPre-pooling patchesApproximate image area
70630~161K pixels
1401,260~323K pixels
280 (default)2,520~645K pixels
5605,040~1.3M pixels
1,12010,080~2.6M pixels

Higher resolution means more token consumption.
Removing the encoder does not make image input free; KV cache and compute scale with token count.

What linear projection loses, what bidirectional attention compensates

When a 16-layer Transformer is replaced by a single linear projection, what is lost?

One thing is inter-patch self-attention.
E4B’s encoder spends 16 layers exchanging information between patches, building spatial relationships like “the patch to the right is an eye, the one below is a mouth.”
With linear projection, each patch is completely independent, mapped from 6,912 raw pixel dimensions to 3,840 LLM dimensions.
No information from neighboring patches enters.

Another is hierarchical visual feature extraction.
Vision Transformers progressively increase abstraction across layers: edges, textures, object parts, whole objects.
A linear projection is a single-step linear function with no nonlinearity and no residual connections.
It handles the conversion from raw pixel patterns to LLM-understandable representations in one linear map.

The mechanism that compensates for this loss in 12B Unified is bidirectional attention, enabled by use_bidirectional_attention: "vision".

Decoder-only Transformers normally compute attention with a causal mask (left-to-right only).
This is correct for text generation, but image patches have no causal relationship.
When laid out in raster-scan order, the top-left patch cannot see any other patches, while only the bottom-right patch can reference all patches, creating an asymmetric situation.

In 12B Unified, the causal mask is removed for image tokens during the attention computation in the LLM body’s 48 layers, allowing all patches to reference all patches.
Text tokens retain the normal causal mask.
In the HuggingFace Transformers implementation, when use_bidirectional_attention: "vision" is set, only image tokens have their causal mask removed; text tokens keep left-to-right causal masking.
Setting it to "all" makes all tokens bidirectional, and sliding_window changes to (sliding_window // 2) + 1.

The patch-level attention that the Vision Encoder handled across its dedicated 16 layers is offloaded to the LLM body’s 48 layers.
However, the LLM’s attention layers reuse weights trained for language tasks on image patches too, so it is not fully equivalent to a dedicated Vision Encoder.
EVEv2 (discussed below) addressed this by separating Q/K/V projection matrices per modality, but Gemma 4 12B Unified does not go that far.

The benchmark impact is visible in MMMU Pro.
12B Unified scores 69.1%, while the encoder-equipped 31B Dense scores 76.9%. The gap is about 8 points.
On the text-side MMLU Pro, the scores are 77.2% vs 85.2%, also about 8 points apart, suggesting the gap is dominated by parameter scale rather than encoder-free-specific vision degradation.

For VLM spatial reasoning in general, “Spatial Blindspot of VLMs” (2025, arXiv:2601.09954) reports accuracy around 50-60% regardless of architecture, indicating this is not an encoder-free-only problem.

Audio is also encoder-free

E4B’s audio processing uses a 12-layer conformer encoder (hidden_size 1,024, 8 heads) with roughly 300M parameters.
It extracts 128-dimensional mel spectrograms, processes them through conformer layers, and compresses the time axis by 4x via subsampling convolutions before feeding the result to the LLM.

12B Unified slices raw 16kHz waveforms into 40-millisecond frames (640 samples) and linearly projects them.

// 12B Unified Audio Embedder (config.json excerpt)
{
  "model_type": "gemma4_unified_audio",
  "audio_embed_dim": 640,
  "audio_samples_per_token": 640,
  "hidden_size": 640,
  "output_proj_dims": 640
}

No mel spectrogram extraction, no conformer layers. Raw amplitude values are projected directly.
Maximum supported audio length is 30 seconds.
Positional information is handled by the LLM body’s RoPE (since this is a 1D time series, no two-axis factorized decomposition is needed).

31B Dense does not support audio, so 12B Unified is the only Gemma 4 model that handles all three modalities: text, image, and audio.

Encoder-free VLM lineage

Encoder-free VLM design did not start with Gemma 4.
Multiple research efforts since 2023 have tried this approach, each hitting different problems and arriving at different solutions.

Fuyu-8B (Adept, October 2023)

The first practical encoder-free VLM.
Based on Persimmon-8B (a 36-layer decoder-only Transformer), it flattens 30x30-pixel patches and projects them to the LLM embedding space with a single [2,700, 4,096] linear projection.
The projection layer has roughly 11M parameters.

Fuyu’s distinguishing feature was handling arbitrary resolutions directly.
Patches are laid out in raster-scan order with a special image-newline token inserted at the end of each row to mark line breaks.
No image-specific positional encoding was used; the LLM’s standard text positional encoding was applied as-is.
Inter-patch attention was also left causal (left-to-right only).

On benchmarks, Fuyu scored 74.2% on VQA-v2, but collapsed to 10.7% on MMBench for compositional reasoning (LLaVA-1.5 scored 64.3% at the same time).
A design relying on text positional embeddings for spatial information with only causal inter-patch attention had clear limits in spatial visual reasoning.

EVE (NeurIPS 2024) and training collapse

This work improved on Fuyu’s approach and showed how to stabilize encoder-free training.
Based on Vicuna-7B, it adds two components: PEL (Patch Embedding Layer) and PAL (Patch Aligning Layer).

PEL consists of stride-14 convolution, stride-2 average pooling, two-stage cross-attention (local receptive field + global CLS token), and a 2-layer FFN.
Learnable <SPL> tokens are inserted at each row’s end to maintain the 2D layout, similar to Fuyu’s image-newline approach.

EVE’s central finding is that encoder-free training collapses without staged learning.

A randomly initialized projection layer sends image patches as near-noise to the LLM.
Since the LLM’s parameters are tuned for text processing, noise from images causes image-loss gradients to corrupt text performance while text-loss gradients hinder image adaptation.
This gradient interference accumulates across training steps, collapsing the entire model.

In Stage 1 (LLM-guided pre-alignment), only PEL and PAL are pre-trained on 16M image-text pairs with the LLM weights frozen.
Learning rate 4e-4, batch size 512, roughly 2 days on A100x16.
Only after the projection layer is reasonably aligned with the LLM’s embedding space does Stage 2 unfreeze the LLM weights.

The EVE paper’s ablation shows what happens without this stage.
Doubling the training data from 4M to 8M while skipping Stage 1 caused VQA-v2 to collapse from 64.6% to 50.2%.
Accuracy dropped 14 points despite doubling the data.
More data means more gradient steps, amplifying gradient interference.
Sweeping learning rates from 2e-5 to 1e-3 did not stabilize training without the freezing stage.

Another key finding is PAL (Patch Aligning Layer) distillation.
PAL applies an MSE loss during training to push the projection layer’s output toward the output of a frozen CLIP ViT-L/14.
Features are extracted from every L/4th layer of EVE’s internal Transformer, reshaped into 2D spatial arrangement after removing CLS and padding tokens, and the distance to the CLIP output is minimized.
CLIP ViT-L/14 is not used at inference. Encoder visual knowledge is only indirectly absorbed during training.
This improved VQA-v2 by roughly 3%.

Training uses three stages total.
Stage 1 (pre-alignment, 16M, PEL+PAL only, LLM frozen) to Stage 2 (generative pre-training, 33M, all parameters) to Stage 3 (SFT, 665K-1.8M, all parameters).
Roughly 9 days total on 2x 8-A100 (40GB) nodes.

Final results: VQA-v2 75.4% (LLaVA-1.5: 78.5%), MMBench 49.5% (LLaVA-1.5: 64.3%).
A massive improvement from Fuyu’s 10.7% on MMBench, but a 15-point gap to encoder-equipped models remained.

EVEv2 (ICCV 2025 Highlight)

EVEv2 introduced a Divide-and-Conquer design that separates Transformer layer internals per modality.

Q/K/V projection matrices, output projection matrices, LayerNorm, and FFN all have separate weights for vision tokens and text tokens.
In equation form, given token xix_i with modality ui{v,t}u_i \in \{v, t\}, the projections are Qi=xiWQ(ui)Q_i = x_i W_Q^{(u_i)}, Ki=xiWK(ui)K_i = x_i W_K^{(u_i)}, Vi=xiWV(ui)V_i = x_i W_V^{(u_i)}.
The attention computation itself shares the same matrix across vision and text, so cross-modal information exchange is preserved.
No MoE-style routing learning is needed; the branching is static by modality type.

Patch embedding uses a two-stage design: stride-16 convolution, GELU, stride-2 convolution, handling up to 2.5 million pixels (~2,500 patch tokens).
The distillation approach also changed from EVEv1: MSE distillation from CLIP was dropped in favor of using high-quality synthetic captions generated by DenseFusion++ as training data.

Training expanded to four stages.
Stage 1 (patch embedding only, EVE-recap-10M) to Stage 2.1 (patch embedding + vision-specific layers, 77M) to Stage 2.2 (all layers, 5M) to Stage 3 (SFT, 7M).
The LLM freeze period is longer than in EVEv1, with only vision-side parameters updated through Stage 2.1.
Total data volume is roughly 100 million images.

At 7B scale: MMBench 66.3%, OCRBench 702.
The encoder-equipped LLaVA-1.6 scores MMBench 67.4% and OCRBench 532, so the MMBench gap narrowed to 1.1 points and OCR performance significantly exceeded the encoder-equipped model.

Mono-InternVL (CVPR 2025)

Mono-InternVL embeds vision-specific FFN experts inside the Transformer layers.
Vision tokens are statically routed to FFN_v, text tokens to FFN_t.
The concept is close to EVEv2’s Divide-and-Conquer, but Mono-InternVL separates only the FFN while sharing the attention side.

The training scale is an order of magnitude larger. EViP (Endogenous Visual Pre-training), a three-stage pre-training process, uses roughly 1.3 billion images total, trained on 256 A100s for 16 days.
Compared to EVE’s total of roughly 50 million images, that is 25x the data volume.
Throughout all pre-training phases, the LLM body is frozen, training only the projection layer and modality-specific FFN. All parameters are unfrozen only at the SFT stage.
In their experiments, training all LLM parameters from the start yielded the worst performance.
Freeze + delta-tuning outperformed by +18.8% on SQA-I and +16.1% on AI2D, confirming EVE’s finding that early LLM unfreezing causes collapse even with massive data.

Even at 1.8B scale, accuracy matched the encoder-equipped InternVL-1.5-2B, making it the first case where an encoder-free model matched an encoder-equipped model of similar size on accuracy.

What 12B Unified chose

Gemma 4 12B Unified’s design takes Fuyu’s linear projection as a base, reinforced with bidirectional attention and factorized positional embedding.

E4B has 42 layers with hidden_size 2,560, while 12B Unified expands to 48 layers with hidden_size 3,840.
The Transformer side is larger to absorb the visual feature extraction that the Vision Encoder used to handle.
As EVEv2 and Mono-InternVL demonstrated, encoder-free designs need extra LLM capacity to match encoder-equipped accuracy.

EVEv2’s Divide-and-Conquer (per-modality Q/K/V separation) and Mono-InternVL’s vision-specific FFN experts are not adopted.
The existing LLM layers are shared as-is, with only bidirectional attention as compensation.
Architecturally, the design stays closer to Fuyu’s simplicity.

The training pipeline is unpublished.
As of June 2026, Google has not released a Technical Report for Gemma 4. Training data, whether pre-alignment was performed, whether encoder distillation was used, and ablation results are all unknown.
Gemma 3 had a report (arXiv:2503.19786) documenting training via knowledge distillation from Gemini 2 (sampling 256 logits per token, weighted by teacher probabilities for cross-entropy loss), but Gemma 4 lacks even that.
Since encoder-equipped variants (E4B, 31B) were released simultaneously, Google internally has trained Vision Encoder weights, placing them in the same position as EVE’s PAL to use as a teacher signal for the projection layer.

TTFT and memory

Going encoder-free shrinks the model’s loaded size.
E4B’s Vision Encoder 150M + Audio Encoder 300M = 450M parameters disappear, replaced by roughly 35M in projection layers.
That is about 800MB in bfloat16 or about 200MB in 4-bit quantization.

With a 12B LLM body, the impact on total loaded size is a few percent, but on memory-constrained edge devices like Raspberry Pi or Jetson Nano, 800MB is not negligible.

EVE’s paper provides measured TTFT (Time-to-First-Token) data.

ModelVision FLOPsVision processing timeTotal inference time
LLaVA-1.5 (CLIP ViT-L)372G0.033s0.48s
EVE-7B (encoder-free)42G0.003s0.48s
LLaVA-1.6 HD1,860G0.13s2.07s
EVE-7B HD170G0.013s1.52s

At standard resolution, Vision FLOPs drop by 9x, but total inference time is 0.48s for both.
The LLM body’s forward pass dominates, making the 0.033s-to-0.003s Vision Encoder savings negligible in the total.

The difference appears at high resolution.
At HD resolution, LLaVA-1.6’s Vision FLOPs balloon to 1,860G versus EVE-7B HD’s 170G, an 11x gap.
As patch count increases, Vision Encoder processing cost grows nonlinearly, giving encoder-free an advantage at high resolutions.
Total inference time also drops from 2.07s to 1.52s, a 27% reduction.

Mono-InternVL reports TTFT improving from 0.45s to 0.15s at 1.8B scale, a 67% improvement.
Smaller models have relatively larger encoder overhead, so edge-targeted small encoder-free models benefit the most.

Google’s official claim for 12B Unified is “up to 40% TTFT improvement,” but resolution conditions and batch size details are not published.

GGUF/llama.cpp support status

12B Unified was released on June 2-3, 2026, so it is still early.
GGUF files are published by unsloth, covering 19 quantization levels from 2-bit at 4.21GB to BF16 at 23.8GB.
Q4_K_M runs at 7.12GB (8GB RAM), Q8_0 at 12.7GB (14GB RAM).

Text inference and image input are working.
Image input loads mmproj-BF16.gguf (946MB) separately and processes through llama-mtmd-cli.
Audio input is not yet stable.

Gemma4UnifiedForConditionalGeneration is a different architecture class from E4B’s Gemma4ForConditionalGeneration, requiring per-backend support.
Bidirectional attention for image tokens, raw-waveform audio input, and variable token budgets all need individual implementation on the backend side.
As covered in the Gemma 4 family article, E2B/E4B/26B A4B/31B have fundamentally different multimodal input paths from 12B Unified, so E4B working does not guarantee 12B Unified works too.
As written in the local RAG setup article with FastAPI, Chroma, and Open WebUI, a model understanding images and actually getting image results from a local UI are separate matters. Encoder-free eliminates encoder file management, but patching, pooling, projection, and bidirectional attention processing must be built into the backend.

References