Gemma 4 is a big deal
Google just released Gemma 4, and it deserves more attention than a typical model drop. This isn't an incremental update. It's a family of four open models, released under Apache 2.0, that redefine what's possible at every parameter scale, from phones to single-GPU workstations. The 31B dense model currently sits as the #3 open model on the Arena AI text leaderboard. The smaller E4B edge model beats the previous generation's 27B model on most benchmarks at roughly one-sixth the size. And all of it is genuinely open source. Here's why this matters.
The lineup
Gemma 4 ships in four sizes, each designed for a specific hardware tier:
- Gemma 4 E2B: 2.3B effective parameters (5.1B total with embeddings), 128K context window. Built for phones and IoT devices.
- Gemma 4 E4B: 4.5B effective parameters (8B total), 128K context. The sweet spot for edge devices with native audio, video, and image processing.
- Gemma 4 26B A4B: A 26B-parameter Mixture of Experts model that activates only 3.8B parameters per token. 256K context. Designed for fast inference on consumer GPUs.
- Gemma 4 31B: A 31B dense model with 256K context. The flagship for raw quality and fine-tuning.
The naming convention alone signals a shift in how we talk about models. The "E" prefix means effective parameters, subtracting the large but compute-cheap embedding tables from Per-Layer Embeddings. The "A" prefix means active parameters, the fraction of a Mixture of Experts model that actually fires per token. Both exist because total parameter count has become a misleading metric.
The benchmarks are not subtle
The generational leap here is striking. On the AIME 2026 math competition benchmark, the 31B model scores 89.2%. The previous generation Gemma 3 27B scored 20.8%. That's not a marginal improvement. That's a four-fold increase on a challenging reasoning benchmark from a model that's roughly the same size. The 31B model ranks #3 among all open models on the Arena AI text leaderboard, and the 26B MoE secures #6, outcompeting models 20x its size. On Hugging Face, the estimated LMArena scores put the 31B at 1452 and the 26B MoE at 1441 with just 4B active parameters. For context, the 26B model achieves near-identical arena scores to the dense 31B while doing roughly one-eighth the compute per token. The smaller models tell a similar story. The E4B exceeds Gemma 3 27B on most benchmarks despite being a fraction of the size. These aren't cherry-picked results. The improvements span math, reasoning, coding, instruction following, and multimodal understanding.
Architecture choices that matter
What makes Gemma 4 interesting isn't just the numbers. It's the engineering decisions behind them. Per-Layer Embeddings (PLE) give every decoder layer its own small embedding lookup for each token, rather than relying on a single shared embedding at input. This lets each layer receive token-specific information only when it becomes relevant. The embedding tables are large in storage but cheap in compute, which is why the effective parameter count is much lower than total. For the smaller models, this is the key innovation that lets a 5.1B-parameter file perform like a 2.3B model in terms of compute cost. Shared KV Cache means the last N layers of the model reuse key-value states from earlier layers instead of computing their own. This reduces both memory and compute during inference with minimal quality impact, which is critical for long-context and on-device use cases. Hybrid attention alternates between local sliding-window attention (512 tokens for smaller models, 1024 for larger ones) and full global attention layers. Combined with dual RoPE configurations, standard for sliding layers and proportional for global layers, this enables the 128K and 256K context windows without the quadratic cost of full attention everywhere. Native multimodality across the entire family. All four models process images and video with variable aspect ratio support. The two edge models add native audio input. The vision encoder uses learned 2D positions and can encode images at different token budgets (70, 140, 280, 560, 1120), letting developers trade between speed, memory, and quality. These aren't flashy features. They're practical engineering trade-offs that make the models deployable in real environments.
Apache 2.0 changes everything
Previous Gemma models shipped with Google's custom terms of service. Llama models have their own restrictive license. These licensing constraints have been a real barrier to adoption, especially at mid-sized and large companies where legal review of custom AI licenses can take weeks or months. Gemma 4 ships under Apache 2.0, one of the most permissive open-source licenses available. No usage restrictions. No revenue thresholds. No special conditions for commercial use. This is a genuine open-source release, not "open weights with strings attached." The timing matters. Chinese open model labs like Qwen and DeepSeek have been shipping with permissive licenses for over a year, and their adoption has surged partly because of it. Google is matching that standard, and for enterprises that care about model provenance and want a U.S.-built alternative, the combination of Apache 2.0 plus Google's name is compelling. The developer community has already downloaded Gemma models over 400 million times, building more than 100,000 variants in the broader Gemmaverse. Removing the licensing friction should accelerate that significantly.
On-device AI gets real
The edge models are arguably the most important part of this release. Running AI locally, without cloud dependency, has been a goal for years. Gemma 4 makes it practical. The E2B and E4B models were engineered specifically for phones, Raspberry Pi, and NVIDIA Jetson Orin Nano. They run completely offline with near-zero latency. Google worked directly with Qualcomm and MediaTek to optimize for their mobile hardware, and Android developers can prototype agentic flows in the AICore Developer Preview today. This isn't about running a chatbot on your phone. It's about building autonomous AI workflows, function calling, multi-step planning, structured JSON output, that execute entirely on-device. No API calls. No data leaving the device. No inference costs. The E4B model processes text, images, video, and audio natively. You can build a local assistant that watches your screen, listens to a conversation, reads a document, and takes structured action, all running on a phone chipset. The 128K context window means it can handle substantial input without truncation. Gartner predicts that by 2027, organizations will use small, task-specific AI models at least three times more than general-purpose LLMs. Gemma 4's edge models are positioned squarely for that shift.
The agentic angle
All Gemma 4 models support native function calling and structured JSON output. This is table stakes for agentic AI, but Google has gone further than most open models here. Native system prompt support means developers can set persistent behavioral instructions without prompt engineering workarounds. The models are trained for multi-step planning and can interact with external tools and APIs as part of autonomous workflows. Combined with the context windows (128K for edge, 256K for larger models), you can pass entire codebases or document sets in a single prompt and have the model reason over them while calling tools. The 26B MoE model is particularly interesting for agents. With only 3.8B active parameters per token, it generates tokens fast, which matters when your agent is making multiple sequential tool calls. But it draws on the full 26B parameter space for quality, which means it doesn't sacrifice intelligence for speed.
The competitive picture
Gemma 4's main competitor in the ~30B class is Qwen 3.5 27B, which has been the default choice for researchers and enterprises in this size range. Early benchmarks suggest the two are very close in performance, with Gemma 4 having an edge in some areas and Qwen in others. But as Nathan Lambert at Interconnects AI pointed out, Gemma 4's success will be "entirely determined by ease of use, to a point where a 5-10% swing on benchmarks wouldn't matter at all." Previous Gemma models were plagued by tooling issues and poorer performance when fine-tuned. If Google has fixed those problems, and the Apache 2.0 license suggests they're serious about ecosystem adoption, then Gemma 4 could become the default open model for a lot of use cases. Day-one support is already broad: Hugging Face Transformers, vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM, LM Studio, Keras, and many others. That's a strong start, but the real test will be how well these integrations hold up in production over the coming weeks.
What this means
Gemma 4 is significant because it compresses frontier-level capabilities into sizes that are actually deployable. A 31B model that ranks #3 globally can run on a single 80GB GPU. A 26B MoE that ranks #6 can run on consumer hardware in quantized form. An edge model that beats last generation's flagship can run on your phone. This is the trajectory that matters in AI right now. Not who can build the biggest model, but who can deliver the most intelligence per parameter, per dollar, per watt. Google has been investing heavily in efficiency research, from TurboQuant compression to the architectural innovations in Gemma 4, and the results are showing. For developers and builders, the practical takeaway is simple: if you haven't evaluated open models recently, now is the time. The gap between open and closed models has narrowed dramatically, and for many specific tasks, a fine-tuned Gemma 4 may outperform a general-purpose frontier model at a fraction of the cost. The era of needing a massive GPU cluster to run competitive AI is ending. Gemma 4 is one of the clearest signs yet.
References
- Gemma 4: Byte for byte, the most capable open models, Google Blog, April 2026
- Welcome Gemma 4: Frontier multimodal intelligence on device, Hugging Face Blog, April 2026
- Google's Gemma 4 Runs Frontier AI On A Single GPU, Forbes, April 2026
- Gemma 4 and what makes an open model succeed, Interconnects AI, April 2026
- Google's Gemma 4: Is it the Best Open-Source Model of 2026?, Analytics Vidhya, April 2026
- Bring state-of-the-art agentic skills to the edge with Gemma 4, Google Developers Blog, April 2026
- Gemma 4 on Google Cloud, Google Cloud Blog, April 2026
- Gemma 4 model overview, Google AI for Developers
You might also enjoy