Gemma 4 Deep Dive — How Apache 2.0 Opens the Door to Sovereign AI

Executive Summary

Google DeepMind's Gemma 4, released April 2, 2026, is more than a generational upgrade. The four-model family — E2B, E4B, 26B MoE, and 31B Dense — spans a single architectural lineage from smartphone chips to datacenter GPUs. Most significantly, it ships under a fully open Apache 2.0 license. Where previous Gemma releases carried custom licensing terms that required legal review before enterprise deployment, Gemma 4 imposes no restrictions on commercial use, redistribution, or fine-tuning derivatives.

Two architectural innovations deserve attention. First, Per-Layer Embeddings (PLE) inject a dedicated per-token signal into every decoder layer, giving small models the representational capacity of larger ones without proportional compute overhead. Second, the 26B MoE model activates only 3.8B of its 25.2B parameters during inference — delivering 26B-class reasoning quality at roughly 4B-class throughput. The result: Arena AI open leaderboard rank #6 performance on a consumer RTX GPU.

VentureBeat assessed the license change as "more significant than the benchmarks." For the first time, enterprises have a clear legal path to running frontier-capable open models entirely on-premises, with no data leaving the organization. For teams designing sovereign AI infrastructure and on-premises Data Greenhouses, Gemma 4 represents a meaningfully different set of options than anything available before.

Model Family Overview

According to Google's official blog, Gemma 4 was designed to deliver "frontier-class reasoning wherever you need it." The four sizes aren't simply a parameter scale — each is optimized for a specific deployment environment.

Model	Effective Params	Total Params	Context	Input Modalities	Target Hardware
E2B	2.3B	5.1B	128K	Text, Image, Audio	Smartphones, Raspberry Pi, NVIDIA Jetson
E4B	4.5B	8B	128K	Text, Image, Audio	Laptop GPU, edge servers
26B A4B MoE	3.8B (active)	25.2B	256K	Text, Image	Consumer GPU, workstation
31B Dense	31B	31B	256K	Text, Image	NVIDIA H100 80GB (single GPU)

The "E" prefix stands for Effective Parameters. The E2B behaves like a 2.3B model at inference but occupies 5.1B on disk. This gap is explained by Per-Layer Embeddings (PLE): the per-layer embedding tables add storage weight but minimal compute overhead during inference.

The "A" in 26B A4B stands for Active Parameters — only 3.8B of the model's 25.2B total parameters are activated per inference pass. Hugging Face noted that this achieves an Arena AI text score of 1,441 on just 4B of active compute — a ratio they described as "mind-blowing."

Two Deployment Tiers

Edge Tier (E2B, E4B): Mobile-first design. Built in collaboration with Qualcomm and MediaTek for fully offline operation. Native audio support (ASR, speech-to-translated-text). Available today in the Android AICore Developer Preview.

Workstation Tier (26B MoE, 31B Dense): Unquantized bfloat16 fits on a single H100 80GB. Quantized versions run on consumer GPUs. Both available serverless on Cloud Run with NVIDIA RTX Pro 6000, scaling to zero when idle.

Architecture Deep Dive

Hugging Face's technical analysis reveals that Gemma 4's architecture is built around a carefully chosen combination of proven techniques — optimized for compatibility, efficiency, and long-context support simultaneously.

Per-Layer Embeddings (PLE)

In a standard transformer, each token receives a single embedding vector at input, and that same vector forms the basis of the residual stream across all layers. This forces the embedding to front-load everything all layers might need — a compressive bottleneck.

PLE approaches this differently. For each token, it generates a small dedicated vector for every decoder layer, injected alongside the main residual stream. Two signals are combined: a token-identity component (via embedding lookup) and a context-aware component (a learned projection of the main embedding). Each layer receives a signal that activates only when token-specific information is relevant at that depth, removing the pressure to compress everything into one upfront embedding.

Why PLE is especially effective for small models

The PLE dimension is much smaller than the main hidden size, adding meaningful per-layer specialization at modest parameter cost. Storage and compute are decoupled — E2B occupies 5.1B on disk but runs like 2.3B. For multimodal inputs, PLE is computed before soft tokens replace the placeholders, so image and audio positions receive a neutral per-layer signal.

Shared KV Cache

The last num_kv_shared_layers layers of the model skip computing their own key and value projections. Instead, they reuse the K and V tensors from the last non-shared layer of the same attention type (sliding or global).

Hugging Face reports minimal quality impact with significant efficiency gains in both memory and compute — particularly valuable for long-context generation and on-device use. It's a core mechanism enabling the 256K context window within practical memory budgets.

Hybrid Attention: Sliding + Global

Gemma 4 alternates local sliding-window attention layers with global full-context attention layers. Smaller models (E2B, E4B) use 512-token sliding windows; larger models use 1,024 tokens. The final layer is always global attention. RoPE is also dual-configured: standard RoPE for sliding layers, Proportional RoPE for global layers to handle longer positions.

26B MoE: 128 Small Experts

VentureBeat specifically highlighted the architectural choice in the 26B MoE. While recent large MoE models typically use a handful of large experts, Gemma 4's 26B MoE uses 128 small experts, activating 8 per token plus one always-on shared expert.

The practical consequence is inference economics. The model benchmarks competitively with 27B–31B dense models while running at roughly 4B throughput. For production workloads — coding assistants, document processing pipelines, multi-turn agentic flows — fewer GPUs, lower latency, and cheaper per-token inference are direct outcomes.

Quantization-Aware Training (QAT) Checkpoints

Google ships QAT checkpoints alongside the bfloat16 originals, enabling quality-preserving quantization for consumer GPU deployment. This is separate from standard post-training quantization — the quality preservation at lower precision is built into the training process itself.

Multimodal Capabilities & Agentic Workflows

Previous generations of open models bolted multimodal capabilities onto text backbones. Vision encoders were added post-hoc; audio required external ASR pipelines like Whisper. Gemma 4 integrates these at the architecture level across all four models.

Vision Encoder: Variable Resolution + Configurable Token Budget

Gemma 4's vision encoder makes two key improvements over Gemma 3n. First, variable aspect-ratio support preserves original image proportions. Second, a configurable visual token budget (70, 140, 280, 560, or 1,120 tokens per image) lets developers trade off speed against detail quality.

Low budgets (70 tokens) work for classification and captioning; high budgets (1,120) handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video inputs (processed as frame sequences) are supported natively.

Audio Encoder: Edge Models Only

The two edge models include a USM-style conformer audio encoder. The encoder was compressed from 681M to 305M parameters compared to Gemma 3n, while frame duration dropped from 160ms to 40ms for more responsive transcription. ASR and speech-to-translated-text run fully on-device.

For use cases that must keep data local — healthcare, field service, multilingual customer interaction — running ASR, translation, reasoning, and function calling in a single on-device model is a genuine architectural simplification.

Native Function Calling: The Agentic Foundation

Function calling is native across all four models, drawing on Google's FunctionGemma research. Unlike approaches that rely on instruction-following to coax models into structured tool use, Gemma 4's function calling was trained in from the ground up — optimized for multi-turn agentic flows with multiple tools.

Structured JSON output and native system instructions are also supported. The practical implication: less prompt engineering overhead when building tool-using agents, and more reliable tool invocation in production.

Multimodal Agentic Capabilities at a Glance

GUI detection & pointing — outputs bounding boxes as JSON natively, no special prompting needed
OCR & document parsing — high token budgets enable precise text extraction from complex layouts
Video understanding — frame sequence processing with audio (edge models) or without (workstation models)
Code generation — offline local AI coding assistant use case
140+ languages — natively trained across more than 140 languages

Fine-tuning & Domain Adaptation

Google's official blog cited two direct examples of successful domain adaptation on Gemma. INSAIT used it to build BgGPT, a Bulgarian-first language model. Yale University applied it to Cell2Sentence-Scale to discover new pathways for cancer therapy. These are not hypothetical applications — they shipped before Gemma 4 even launched.

Fine-tuning Options

TRL (Transformers Reinforcement Learning)

Hugging Face upgraded TRL alongside the Gemma 4 launch to support multimodal tool responses during training — models can now receive images back from tools during environment interaction, not just text. The example script trains the E2B to drive in a CARLA simulator by learning from camera input and rewards. The same approach applies to robotics, web browsing, or any interactive visual environment.

Unsloth Studio

A UI-based fine-tuning platform. Installable on MacOS, Linux, WSL, and Windows. Runs locally or on Google Colab. Works on gaming GPUs — no specialized hardware required to get started.

Vertex AI + Custom Docker

Hugging Face published a complete example for Vertex AI Serverless Training Jobs, including how to freeze the vision and audio towers while extending only the function-calling capability via SFT. It's the reference implementation for cloud-based enterprise fine-tuning workflows.

Fine-tuned derivatives are fully free to deploy commercially under Apache 2.0. The legal ambiguity that existed with previous Gemma custom licensing around derived models is gone.

Deployment Ecosystem

Gemma 4 launched with day-one support across the major inference and fine-tuning tools. Per Google's official announcement:

Local Inference

Ollama
LM Studio
llama.cpp
MLX (Apple Silicon)
Mistral.rs
Transformers.js (WebGPU)

Production Serving

vLLM
NVIDIA NIM & NeMo
SGLang
Baseten
Docker
Google Cloud Run (serverless)

Fine-tuning & Training

Hugging Face Transformers + TRL
Unsloth Studio
Keras, MaxText, Tunix
Vertex AI
Google Colab

Hardware Optimization

NVIDIA (Jetson Orin → Blackwell)
AMD ROCm™
Google Trillium & Ironwood TPU
Qualcomm (mobile)
MediaTek (mobile)

VentureBeat highlighted the Cloud Run serverless deployment as particularly notable. Running on NVIDIA RTX Pro 6000 GPUs, the configuration scales to zero when idle — paying only for actual inference compute. For internal tools and lower-traffic applications, this changes the economics of deploying open models in production significantly.

Model weights are available for download from Hugging Face, Kaggle, and Ollama. Both pre-trained base models and instruction-tuned (IT) variants are released.

Benchmark Analysis

Google's published numbers show a clear generational leap. VentureBeat noted that these scores "would have been frontier-class from proprietary models not long ago."

Benchmark	31B Dense	26B MoE	E4B	E2B	Gemma 3 27B (ref.)
AIME 2026	89.2%	88.3%	42.5%	37.5%	20.8%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%	29.1%
GPQA Diamond	—	82.3%	—	—	—
MMMU Pro (vision)	76.9%	—	—	—	—
MATH-Vision	85.6%	—	—	—	—
Codeforces ELO	2,150	—	—	—	—
Arena AI (text)	1,452 (#3)	1,441 (#6)	—	—	—

Sources: Google DeepMind official release (2026.04.02), Arena AI leaderboard (as of 2026.04.01)

The jump from Gemma 3 27B's 20.8% to Gemma 4 31B's 89.2% on AIME 2026 is not incremental improvement — it reflects a qualitative shift in reasoning capability. That the 26B MoE achieves 88.3% with only 3.8B active parameters is the more operationally significant number.

The edge models are equally striking. The E4B hits 42.5% on AIME and 52.0% on LiveCodeBench — exceeding Gemma 3 27B on most benchmarks at roughly one-sixth the size. Google's claim of "unprecedented intelligence-per-parameter" holds up to scrutiny.

Benchmark context

VentureBeat cautions that benchmarks need to be read against a competitive landscape that includes Qwen 3.5, GLM-5, and Kimi K2.5 — all aggressive competitors in this parameter range. What distinguishes Gemma 4 is less any single score and more the combination: strong reasoning, native multimodality across text, vision, and audio, built-in function calling, 256K context, and a genuinely permissive license — in a single model family.

The Apache 2.0 Shift

VentureBeat's Sam Witteveen wrote a dedicated piece on the license change: "Google releases Gemma 4 under Apache 2.0 — and that license change may matter more than benchmarks." It's not hyperbole.

Previous Gemma releases used a custom license with usage restrictions, terms Google could modify at will, and provisions requiring legal interpretation before commercial deployment. Enterprise teams that might have preferred Gemma's performance routinely chose Mistral or Alibaba's Qwen instead — both Apache 2.0 — because the legal review overhead was too high.

Gemma 4 eliminates that friction entirely. No custom clauses. No Harmful Use carve-outs requiring interpretation. No restrictions on redistribution or commercial deployment. Google's official announcement explicitly invokes "digital sovereignty" and "complete control over your data, infrastructure, and models."

The timing is notable: as some Chinese AI labs have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction.

What Apache 2.0 Changes in Practice

Start evaluation without legal review — no procurement friction before testing
Deploy fine-tuned derivatives commercially — no licensing ambiguity around derived models in production
Run fully on-premises — no data leaves the organization, no API dependency
Redistribute and wrap freely — containerized images, packaged services, SaaS offerings all permitted

Why Pebblous watches this research

Designing a sovereign on-premises Data Greenhouse requires a foundation model that can be domain-adapted, deployed entirely within organizational boundaries, and integrated into agentic data pipelines — simultaneously. Until now, meeting all three conditions in a single open model required real compromises.

Gemma 4's Apache 2.0 provides the legal foundation. The 26B MoE's inference economics provide the hardware foundation. Native function calling and 256K context windows connect directly to the Agentic AI Data Scientist architecture — where models need to interact with data pipelines, call APIs, and reason over long document contexts in a single pass. The architecture-level multimodal integration maps to the multi-layer structure of a Data Greenhouse, where structured and unstructured data coexist. Pebblous is evaluating Gemma 4 as a candidate foundation layer for Data Greenhouse infrastructure.