Meta Built the Open-Source AI Ecosystem. Then It Faked the Numbers and Left.
Yann LeCun — Meta's departing chief AI scientist — said on the record that the Llama 4 benchmarks were "fudged a little bit." What came next wasn't a correction but a full strategic exit: Behemoth shelved, Muse Spark shipped as a closed API product, and Llama 5 pointed at wearables instead of Hugging Face. The company that created the open-weight ecosystem has walked away from it, and the vacuum it's leaving is being contested by Chinese labs and Mistral — none of whom are immune to the same economics that bent Meta.
The most damning three words in AI this year came from the man who spent a decade building Meta's credibility as an AI research institution. Yann LeCun, in a Financial Times interview after his departure from Meta, said the Llama 4 benchmarks were "fudged a little bit." Not a leak, not a disgruntled engineer — the chief AI scientist, on the record.
What Meta actually shipped on April 5 was this: Scout (109B total parameters, 17B active, 10-million-token context, fits a single H100 with Int4 quantization) and Maverick (400B total, 17B active, 128 experts, 1M context). Capable models. Genuinely efficient MoE architectures. The first open-weight native multimodal MoE — a real architectural milestone.
But the 1417 LMArena ELO — the number Meta led with, the number that claimed to beat GPT-4o and Gemini 2.0 Flash — belonged to Maverick-03-26-Experimental, a chat-tuned variant never publicly released. The actual public weights dropped to roughly 32nd on LMArena within days. LMArena issued a policy rebuke: "did not match what we expect from model providers." Aider Polyglot benchmarks put Maverick at 16% on real coding tasks. DeepSeek V3 and Qwen 3.6 Coder weren't close behind — they were ahead by a margin.
Meta submitted a different model to win a benchmark, promoted the win as if the public weights had earned it, and called it a launch.
Why This Matters More Than the Scandal
Benchmark manipulation is embarrassing. What happened next is structural.
Llama 4 Behemoth — the 2-trillion-parameter teacher model Meta had been teasing, the model that was supposed to justify the whole architecture direction — was quietly shelved — the specific cause never officially confirmed. In April 2026, Meta shipped Muse Spark: its first closed-weight, API-only frontier model. The Meta Superintelligence Labs org, which Zuckerberg has been filling with Thinking Machines Lab hires at reported packages reaching $1.5B, is not building open-weight models. Llama 5 is targeting the Ray-Ban wearables ecosystem. Its Hugging Face future carries Polymarket odds under 20%.
The company that built the open-source AI ecosystem — the one whose Llama 3 weights set the baseline that every fine-tuner, researcher, and budget-constrained startup has been working from for the past year — concluded it could no longer compete at the open-weight frontier on honest terms. So it stopped competing.
That's the bigger story.
Who Fills the Vacuum
Meta's retreat creates a power vacuum that three competitors are positioned to claim — none of them American.
Qwen (Alibaba) had the most credible open-weight alternative at Llama 4's launch, with Qwen 3.6 Coder significantly outperforming Maverick on real coding tasks. Whether Alibaba sustains open-weight releases at the frontier is an open question — the economics that pushed Meta toward closed models apply equally to any lab spending at scale on pretraining.
DeepSeek V4 is MIT-licensed and currently near the top of open-weight benchmarks. It was trained — at least in part — on Huawei Ascend 910C chips, not NVIDIA hardware, which is either a geopolitical resilience story or a performance ceiling story, depending on whether Ascend-trained models can close the performance gap over time.
Mistral remains genuinely open, Apache 2.0, and European. Small 4 (6B active parameters from a 119B total MoE) is arguably the most underrated architecture efficiency story in open-source right now. But Mistral is also raising capital at valuations that will eventually confront the same build-cost pressure that pushed Meta and Alibaba toward closed weights.
The open-weight ecosystem isn't collapsing. It's changing hands — to labs operating under different geopolitical constraints, different capital structures, and very different relationships with the US export control apparatus.
What the Benchmark Crowd Gets Wrong
I want to complicate the easy read here. Maverick's MoE efficiency is genuinely valuable. Seventeen billion active parameters from a 400B total model is a real number with real consequences: enterprises self-hosting Maverick on IBM and Dell infrastructure are paying dramatically less per inference token than closed API alternatives. The multimodal capability works. The architecture functions even when the headline number was gamed.
If Meta's strategic goal was seeding the ecosystem with a capable model that enterprises would build on, Maverick accomplished that — despite the benchmark fraud and the credibility crater that followed. Fortune 500 companies are running it in production. Scout's 10M-token context, even if it degrades badly past 128K in practice (Fiction.liveBench found 15.6% accuracy vs. Gemini 2.5 Pro's 90.6% on the same long-context tests), is still more context than most deployment stacks use.
Meta's failure wasn't the model. It was the panic that led to submitting a non-public variant to win a leaderboard it couldn't honestly top, and then abandoning the strategy that leaderboard was supposed to defend.
Three Things That Would Change My Read
First: if Llama 5 ships as open weights with genuine frontier capability, I'll update toward "temporary retreat, strategic recalibration." Current odds suggest I won't need to update.
Second: if DeepSeek V4's Ascend-trained weights converge on H100 performance as Huawei improves 910C utilization, the open-source vacuum gets a protagonist with staying power — one who is structurally insulated from the economics that bent Meta.
Third: if LMArena publishes transparent submission rules that distinguish public weights from experimental variants, the benchmark integrity problem gets a systemic fix instead of a scandal-by-scandal reckoning. Given that LMArena's authority depends on trust, this seems like their problem to solve as much as any lab's.
Meta built the open-weight table. It got up and left — on the way out, it tried to steal one last headline. The question isn't whether it's coming back. It's whether the table survives the exit.
- https://ai.meta.com/blog/llama-4-multimodal-intelligence/
- https://www.aicerts.ai/news/llama-4-maverick-rethinking-model-evaluation-fairness/
- https://www.fastcompany.com/91469583/yann-lecun-meta-llama-4-model-zuckerberg
- https://the-decoder.com/metas-llama-4-models-show-promise-on-standard-tests-but-struggle-with-long-context-tasks/
- https://rootly.com/blog/llama-4-underperforms-a-benchmark-against-coding-centric-models