Muse Spark Is Extraordinary at Medicine and Mediocre at Novel Reasoning. Meta Called It Personal Superintelligence.

When Meta launched Muse Spark on April 8, coverage led with one number: HealthBench Hard 42.8, more than double Gemini 3.1 Pro's 20.6, triple Claude Opus 4.6's 14.8. A frontier model that destroys the competition on healthcare reasoning. The coverage was not wrong to lead with that number. It just forgot to mention the other one.

Muse Spark scored 42.5 on ARC-AGI-2.

Gemini 3.1 Pro scored 76.5. GPT-5.4 scored 76.1. The gap is 34 points. ARC-AGI-2 is designed specifically to resist benchmark saturation — it measures the ability to solve novel problems using combinations of concepts never seen in training. It is, deliberately, the benchmark that is hardest to game with scale. A 34-point deficit means something real.

Meta described Muse Spark as a step toward "personal superintelligence." If you're grading on the benchmark designed to measure whether a model can think outside its training, Muse Spark is the weakest frontier model in its tier.

These two results — extraordinary at domain-specific healthcare, weak at novel generalization — are probably not independent. The physician data curation pipeline is the most notable operational detail in the Muse Spark launch: more than 1,000 physicians annotated and curated health training data. This is Alexandr Wang's import. Wang came from Scale AI, which operationalizes exactly this kind of expert data annotation at industrial scale. The 42.8 HealthBench Hard is not an architecture story. It's a data story. You hire 1,000 doctors to curate your training data, your model gets good at medicine. The mechanism is not mysterious.

The same mechanism may explain the ARC-AGI-2 weakness. When you weight training heavily toward domain expertise — structured clinical knowledge, physician-validated reasoning patterns, health data with explicit right answers — you may inadvertently crowd out the kind of flexible, novel-concept reasoning ARC-AGI-2 tests. It's a reasonable hypothesis. It hasn't been confirmed. But the 34-point gap is large enough that it demands a hypothesis.

The closed-weight decision is the strategic context. Meta built its entire AI identity on open-source: Llama 1 through Llama 4 Scout and Maverick, all released with model weights. Zuckerberg spent years explicitly positioning Meta as the alternative to the closed-model strategy of OpenAI and Anthropic. Muse Spark is the first model in Meta's history you cannot download and run yourself. It launched free on meta.ai for consumers; the API is private preview only, with no pricing announced.

The forcing function was Behemoth. Llama 4 Behemoth — the ~2-trillion-total-parameter flagship that was supposed to be Meta's answer to GPT-5 — was effectively shelved mid-training after MoE-routing switch issues disrupted expert specialization and chunked-attention blind spots emerged at scale. It was never released. Never formally cancelled. It just stopped being mentioned. Behemoth's failure left Meta without a frontier open-weight model to ship, and rather than try again at 2T scale, Meta built Muse Spark on a different architecture — rebuilt pretraining stack, closed weights, consumer-first release — over nine months.

Meta claims the new stack runs at "over an order of magnitude" less compute than Llama 4 Maverick. This is an extraordinary claim. If accurate, it would be a compute efficiency achievement of DeepSeek-R1 scale — one of the most significant infrastructure advances of the past two years. Meta's blog states it as a fact with no methodology. No coverage verified it. I'd treat this as a placeholder for "the new architecture is significantly more efficient" until someone runs the numbers.

The HealthBench choice is worth examining on its own terms. HealthBench was designed by OpenAI. Meta decided to launch their first closed-weight model by leading with a benchmark their direct competitor built, on which they beat that competitor by 2.7 points (42.8 vs GPT-5.4's 40.1). That is either confidence or provocation, and it's probably both. None of the coverage noted the irony.

Meta Superintelligence Labs, the team behind Muse Spark, was formed in June 2025 as a reorganization of Meta's AI efforts under Wang. It's a different org from FAIR, the research division that produced most of Llama's underlying work. Muse Spark is MSL's first product. The institutional context matters: this is a new team, a new mandate, and a new model architecture, all shipping simultaneously. The 9-month timeline is fast.

What Muse Spark is: the best healthcare AI model on the current frontier, consumer-accessible, built with a physician-curation data strategy that smaller competitors can't easily replicate. That is a real and durable competitive position.

What it isn't: a general-purpose reasoning model that can compete with GPT and Gemini on novel generalization. The ARC-AGI-2 gap says so clearly. Whether that matters depends on what Meta is actually trying to build — a vertical health product, or the general-purpose "personal superintelligence" they described in the launch announcement.

Those two things aren't the same. Based on the benchmarks, Meta has the first one.