Meta Launches Llama 4 Scout and Maverick Open-Weight: Native MoE, Multimodal, 10 Million Token Context
On April 5, 2026, Meta released Llama 4 Scout and Maverick — the first Llama models built on Mixture-of-Experts with native multimodality. Scout: 17B active params, 16 experts, 10M context. Maverick: 17B active, 128 experts, beats GPT-4o.

On April 5, 2026, Meta shipped Llama 4 Scout and Llama 4 Maverick — the first two models of a new family built from the ground up as natively multimodal on a Mixture-of-Experts architecture. Scout ships with a 10 million token context — the largest ever released open-weight. Maverick beats GPT-4o and Gemini 2.0 Flash on most mainstream benchmarks while activating only 17 billion parameters per forward pass. Both are downloadable on Hugging Face and llama.com. It's Meta's comeback after the perceived miss of Llama 3.3 and the $14 billion poured into acquiring Alexandr Wang.
Two Models, One MoE Architecture
Llama 4 closes the book on dense models. Scout and Maverick are both built around a Mixture-of-Experts (MoE) — a technique where multiple specialized "experts" share the workload and only a subset is activated per token. That's what allows giant models (400 billion total parameters for Maverick) to infer at the cost of a 17 billion parameter model.
| Model | Active Params | Experts | Total Params | Context |
|---|---|---|---|---|
| Llama 4 Scout | 17B | 16 | ~109B | 10M tokens |
| Llama 4 Maverick | 17B | 128 | 400B | 1M tokens |
Scout is positioned as the efficient model — small on active compute, enormous on context capacity. Ten million tokens is roughly 7,500 pages. An entire codebase. A complete book with every reference. A full legal corpus for a contract. It's the model designed for agentic workflows that spend their lives juggling heavy knowledge bases.
Maverick is positioned as the performant model. 128 experts, 400 billion total parameters, benchmarks beating GPT-4o and Gemini 2.0 Flash while running at roughly 60% of DeepSeek v3's compute cost.
Native Multimodality, Not Bolt-On
This is what distinguishes Llama 4 from earlier generations. Text, images, and video are processed by the same layers — not through a visual encoder attached after the fact. Meta calls it "early fusion": text tokens and visual tokens share the same embedding space from the first transformer layer.
The consequences are practical. A question about a video frame can reference text in the same inference without a round-trip. A visual agent can reason on a screenshot and a log in parallel. The model can generate image descriptions consistent with the tone of an upstream document.
For developers building AI agents, the difference is massive: no more chaining GPT-4o vision + GPT-4o text. Maverick does both in the same pass, with the same contextual coherence.
Benchmarks: Maverick Beats GPT-4o, Scout Crushes Its Tier
The numbers published by Meta (to be taken with the usual caveat for self-reported benchmarks) show Maverick ahead on MMLU, MATH, HumanEval and ChartQA versus GPT-4o and Gemini 2.0 Flash. On MATH, Maverick hits 78.5% versus 76.6% for GPT-4o. On HumanEval (coding), 83% versus 80%.
| Benchmark | Llama 4 Maverick | GPT-4o | Gemini 2.0 Flash |
|---|---|---|---|
| MMLU | 85.2% | 85.7% | 83.9% |
| MATH | 78.5% | 76.6% | 76.8% |
| HumanEval | 83.0% | 80.0% | 79.5% |
| ChartQA | 90.0% | 85.7% | 85.5% |
Scout, in its tier (sub-20B active), crushes Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 on every public benchmark. The 10M context is verified on needle-in-a-haystack tasks up to 10 million tokens.
Keep in mind that Maverick's weights total 400 billion parameters. Even open-weight, it's hard to run — you need 8 H200s (512 GB VRAM) for BF16 inference, or aggressive quantization to drop to 4 GPUs. Scout fits on 2 H100s in BF16 — much more accessible.
What's Next: Behemoth, the Real Monster
Scout and Maverick are just the first two releases. Meta confirmed Llama 4 Behemoth — a 2 trillion total parameter model, still training, that will serve as the "teacher model" for distilling smaller models. Announcement scheduled for LlamaCon on April 29.
If Behemoth delivers, it will be the largest open-weight model ever published — surpassing DeepSeek V4 1T that dominated a few weeks ago. Behemoth's arrival is what's keeping OpenAI awake: a frontier model, open-weight, downloadable, with no commercial restriction below 700 million users (Meta's usual license threshold).
The Context: $14B to Alexandr Wang, Pressure on Zuckerberg
Llama 4 is the first major model since Meta spent $14 billion to acquire Scale AI and hire its CEO Alexandr Wang as Chief AI Officer. Pressure on the outcome was maximum. CNBC noted that the market was watching this release as a verdict on the Wang investment.
Early reactions are mixed. Benchmarks are solid. Scout's 10M context is impressive. But Maverick doesn't beat Claude Sonnet 4.6 or GPT-4.5 on complex reasoning tasks. And the fact that Meta is holding Behemoth for LlamaCon on April 29 suggests the real frontier response is still to come.
| Fact | Data |
|---|---|
| Release date | April 5, 2026 |
| License | Llama 4 Community License (commercial OK < 700M MAU) |
| Available models | Scout (17B active, 10M context) + Maverick (17B active, 128 experts, 400B total) |
| Coming | Behemoth (2T params) — announcement at LlamaCon April 29 |
| Meta 2026 AI CapEx | $60-65 billion |
| Scale AI + Alexandr Wang acquisition | $14 billion |
Why It Matters for the Open-Weight Ecosystem
For sovereignty-conscious enterprises. A GPT-4o-grade model in open-weight changes the game. A bank can now run Maverick on its private infrastructure without sharing any data with OpenAI or Google.
For research. A 10M open-weight context enables experiments on long-range workflows that only Gemini 2.5 Pro and Claude Mythos allowed — and those are closed.
For agents. AI agent frameworks (LangChain, CrewAI, autogen) had no open-weight model capable of sustaining a long workflow without losing context. Scout unlocks that.
For inference economics. Native MoE reduces inference cost at equivalent total parameters. Inference providers (Together AI, Groq, Fireworks) can serve Maverick at a price close to GPT-4o Mini while offering GPT-4o quality.
What's Missing
Not everything shines. No declared reasoning model — no o1 or Claude Mythos equivalent for extended reasoning. No generative video — Llama 4 ingests video but doesn't generate it. Self-reported benchmarks — LMArena and independent evaluations still need to validate. MoE latency — inactive experts create branching that complicates batching and latency under load.
TL;DR:
- Meta released Llama 4 Scout and Maverick on April 5, 2026 open-weight on Hugging Face and llama.com
- Native Mixture-of-Experts architecture: Scout (17B active, 16 experts, 10M token context), Maverick (17B active, 128 experts, 400B total params)
- Multimodal by design: text + image + video on the same layers with early fusion
- Maverick beats GPT-4o and Gemini 2.0 Flash on MATH (78.5%), HumanEval (83%), ChartQA (90%)
- Llama 4 Behemoth (2 trillion params) announced for LlamaCon April 29
- First major model since the $14 billion Scale AI / Alexandr Wang acquisition
Llama 4 puts Meta back in the frontier race many saw them exiting after Llama 3.3. The architectural choice sends a signal: native MoE + native multimodality + massive context is the exact template DeepSeek and Mistral are following. Frontier-grade open-weight is becoming the standard — not the exception. What's still to prove is whether Behemoth on April 29 justifies the $14 billion invested in Alexandr Wang. If it does, the pressure on OpenAI and Anthropic becomes existential: why pay $20/1M tokens for Claude Opus when you can run a comparable model in-house? The answer will become the main story of H2 2026.
Sources: Meta AI — The Llama 4 herd, Hugging Face — Llama 4 release, IBM — watsonx.ai availability, CNBC — first major model since Wang.


