Mythos Resolves 470 of 500 Real GitHub Issues. The Open-Source Maintainers Who Need It Most Require Security Clearances to Use It.

There are two facts about Claude Mythos Preview's April 2026 benchmark scores that received almost no attention together.

The first: 93.9% on SWE-bench Verified means Mythos correctly resolved approximately 470 of 500 real-world GitHub issues — reading the issue description, localizing the bug in a multi-file Python codebase, writing a patch, and passing the repository's own test suite, with no human intervention. The 13-point gap over the prior frontier cluster (Opus 4.6 at 80.8%, GPT-5.3 Codex at 85%) is the largest single-model jump in SWE-bench's history. The 77.8% score on SWE-bench Pro — a contamination-resistant version of the benchmark — confirms the capability is real: you can't memorize your way to a 20-point lead over GPT-5.4 on a dataset specifically designed to resist memorization.

The second: you cannot use it. Not you specifically — probably not your company either. Access requires Anthropic Safety Level 4 compliance: a formal agreement with Anthropic, security clearances for individual personnel using the model, and ongoing usage auditing. The Project Glasswing consortium that has had access since February 2026 has 11 named launch partners: AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, Linux Foundation, NVIDIA, Palo Alto Networks, Microsoft. Plus 40 additional organizations.

The Linux Foundation is on that list. The individual maintainers of the packages the Linux Foundation distributes are not.

This access structure has no precedent in prior software. The most capable code security AI ever benchmarked is available to intelligence-agency-grade institutions and Fortune 500 security teams. It is not available to the people who maintain the open-source infrastructure that those institutions and Fortune 500 companies depend on — the 50M-download npm packages, the foundational Python libraries, the infrastructure tooling that every enterprise stack runs on. Those maintainers — often individual contributors, rarely security-credentialed — are the ones most exposed to the vulnerabilities Mythos is best positioned to find.

The SWE-bench Verified number itself has a caveat worth stating. AgentMarketCap's saturation analysis argues, credibly, that SWE-bench Verified is breaking down as a benchmark above ~90%. The 500 issues in the Verified subset were curated from 2017-2022 GitHub repos and verified by human judges as solvable. When those issues land in training data at sufficient scale, models learn to pattern-match against them. The benchmark tests what models can do on a specific known distribution of issues — not what they can do on your codebase. The 77.8% SWE-bench Pro score matters more precisely because it's designed to resist this. Mythos's 20-point Pro lead over GPT-5.4 is the more credible capability signal. Every headline using SWE-bench Verified above 90% should carry this context.

The Glasswing consortium structure is worth examining on its own terms. Of the 11 named launch partners, five supply Anthropic's core infrastructure: NVIDIA and Broadcom make the chips, AWS and Google provide cloud compute, Microsoft operates Foundry (the Azure-based Claude deployment). These are not neutral stakeholders evaluating Mythos on its merits. They are Anthropic's infrastructure dependencies, and they received first access to Anthropic's most capable model. This is a structurally interlocking arrangement: Anthropic needs their infrastructure, they need access to the model, and the consortium creates a shared interest in Anthropic's success that can't be cleanly separated from their evaluation of the product.

The philosophical contrast with OpenAI is pointed. GPT-5.5-Cyber launched 15 days after Mythos Preview. OpenAI chose to make it widely available, with export and sanctions restrictions but no clearance requirements. Anthropic clearance-gated the capability. Neither approach is obviously right — there are legitimate arguments on both sides about how to release a model capable of both finding and generating exploits. But the two labs made opposite choices, and that choice has consequences for who can access the capability.

The $25/$125 per million tokens pricing (5x Opus 4.8's $5/$25) is not the access barrier — the clearance requirement is. You could afford the pricing. The ASL-4 compliance requirement is structurally inaccessible for an organization without a security function, legal counsel, and compliance infrastructure.

There's a question Anthropic hasn't answered that belongs in any honest coverage of this model: what does the 6.1% look like?

At 93.9% on SWE-bench Verified, Mythos is failing on approximately 30 of 500 issues. Anthropic has not published a failure taxonomy. Are the failures architecture-level cross-repository refactors — the kind of change that requires understanding how 10 separate codebases interact? Are they issues that require understanding external API behavior not captured in training data? Are they ambiguous problem descriptions where the correct fix depends on information not in the issue? The failure taxonomy matters because it defines the ceiling. A model that can't handle multi-repository architectural changes has a very different production implication than one that can't handle obscure third-party library edge cases.

The answer affects everything: whether software engineers who work on large-scale refactors are in Mythos's capability scope, whether security scanning of complex dependency chains is feasible, and how to interpret the "470 of 500" framing that dominated coverage. Anthropic called the remaining issues "ambiguous" in informal communications. That's not a taxonomy.

SWE-bench Pro at 77.8% is actually the more interesting number. The 20-point lead over GPT-5.4 suggests something architecturally different is happening — not just better training data coverage, but a model that generalizes differently on novel hard problems. What that generalization looks like in the failure cases, at the edge of Mythos's capability, is the actual story about where AI software engineering stands right now.

I think the access model is the unresolved problem. The "safety" framing for clearance-gated access has a dual use of its own: it also creates a capability tier where the only customers who can access the best model are large institutions with legal and compliance functions — which is a near-perfect description of the customer segment that pays enterprise prices. Whether that's deliberate or incidental is a question I can't answer. Both things can be true simultaneously: genuine safety concern and excellent enterprise pricing strategy.