Effective vs. Active Parameters

Every time a new model drops, you see numbers like E2B, A4B, 30B-A3B. Most people gloss over them. But if you actually want to understand what you're running on your hardware, these letters matter. Let's break down what E and A mean, because they're not the same thing, and they're filtering out different things.

Total parameters

This is the raw number. Every single weight in the model file. If a model has 25.2 billion weights, that's 25.2B total parameters. Simple. But total parameters alone doesn't tell you how the model actually behaves at inference time. Two models with the same total can perform very differently depending on their architecture.

E: Effective parameters

Google introduced this with Gemma 4's smaller models (E2B, E4B). The "E" stands for effective. These models use a technique called Per-Layer Embeddings (PLE). Instead of one shared embedding table, PLE gives every decoder layer its own small embedding lookup for each token. This makes the model smarter, but those embedding tables are large in storage while being cheap in compute. They're just lookup tables, not matrix multiplications. So Google separates them out:

	Gemma 4 E2B	Gemma 4 E4B
Total parameters	5.1B	~8B
PLE embeddings	~2.8B	~4B
Effective parameters	~2.3B	~4B

The "E" prefix is Google saying: the model file is bigger than you'd expect, but the compute-relevant part is only this much. It's subtracting the storage-heavy but compute-cheap embedding layers. These are dense models, so all effective parameters are active on every token. E = A here.

A: Active parameters

The "A" prefix comes from Mixture-of-Experts (MoE) architecture. A MoE model has many specialized sub-networks called experts, but only a few are activated per token. A router decides which experts handle each piece of input. Take Gemma 4's 26B-A4B:

Total parameters: 25.2B

Total experts: 128 + 1 shared

Active experts per token: 8 + 1 shared

Active parameters per token: 3.8B

So 25.2B parameters exist in the model, but only 3.8B are doing work for any given token. The rest sit idle. Which experts activate can change from token to token, which is why you still need all 25.2B in memory for best performance. The "A" prefix is saying: this is how much compute actually fires per forward pass.

The key difference

E subtracts parameters that are cheap to use (lookup tables from PLE). They exist, they contribute to quality, but they barely cost compute. A subtracts parameters that are dormant (inactive MoE experts). They exist, they'll be used for other tokens, but they're not active right now. Both prefixes exist because "total parameters" has become a misleading headline number. A 5.1B model that computes like a 2.3B model and a 25.2B model that computes like a 3.8B model are fundamentally different from dense 5B and 25B models.

The full picture

Model	Total	E (Effective)	A (Active)	Architecture
Gemma 4 E2B	5.1B	~2.3B	~2.3B	Dense + PLE
Gemma 4 E4B	~8B	~4B	~4B	Dense + PLE
Gemma 4 31B	31B	31B	31B	Dense
Gemma 4 26B-A4B	25.2B	25.2B	3.8B	MoE (128 experts)

For a plain dense model like the 31B, all three numbers are the same. E and A only diverge when PLE or MoE are in play.

Why this matters for you

Memory: You need enough RAM/VRAM for the total parameters (or close to it). Even in MoE, all experts should be in fast memory. Speed: Inference speed correlates with active parameters. Lower A = faster token generation, since your GPU processes less data per token. Quality: Scales more with total and effective parameters. More experts or richer embeddings mean better representations, even if they're not all active simultaneously. So next time you see a model name with letters and numbers, you'll know exactly what trade-offs are being made. The letters aren't marketing fluff. They're telling you how the model actually runs.