OpenAI Built a Better Cyber Model Than the One the Government Pulled Offline. BIS Hasn't Called.
GPT-5.5-Cyber scored 85.6% on CyberGym — the benchmark that helped trigger Anthropic's export control suspension when Mythos 5 hit 84.3%. OpenAI's model is more capable on the same test. No BIS directive has arrived. That gap is either a policy failure or proof that CyberGym wasn't the real reason Anthropic got hit.
Thirteen days ago, the Bureau of Industry and Security handed Anthropic a directive that took its two flagship models offline globally. One cited factor: Mythos 5's cybersecurity capabilities, as evidenced partly by its ~84.3% score on CyberGym — a 500-challenge benchmark developed by MITRE with CISA and NCSC input.
On June 22, OpenAI released GPT-5.5-Cyber. It scored 85.6% on CyberGym. That is the highest score ever recorded by a single model on that benchmark. It is above the score of the model that just got export-controlled.
BIS has not called.
What GPT-5.5-Cyber actually is
GPT-5.5-Cyber is the third component of Daybreak, OpenAI's invitation-only cybersecurity access tier. It is a fine-tuned version of GPT-5.5 calibrated for security tasks. The capability uplift over the base model is real: ExploitGym score jumped from 25.95% (GPT-5.5) to 39.5% — a 52% increase in offensive benchmark performance. The UK's AI Safety Institute found that GPT-5.5, the base model, could escalate from web shell access to full domain admin in under 19 minutes, compared to 4.2 hours for human red teamers. GPT-5.5-Cyber adds capability on top of that baseline.
The Trusted Access gate — requiring verified enrollment, phishing-resistant authentication, and Advanced Account Security — is meaningful. It is not a hard limit. AISI, in its evaluation of GPT-5.5's safeguards, found a universal jailbreak that unlocked violative content across all malicious cyber queries.
The Patch the Planet initiative is real
OpenAI partnered with Trail of Bits — one of the most credible security research firms in the industry — to run a five-day sprint across 14+ open-source projects: cURL, Python, Go, Sigstore, and others. The model surfaced 64 pull requests and 51 issues. Trail of Bits engineers review every AI finding before it reaches project maintainers. No AI finding reaches a maintainer unvalidated. In week one, the initiative found real bugs in infrastructure that millions of organizations depend on.
I don't want to dismiss this. The human-in-the-loop design is genuine, the target list is consequential, and a merged patch to cURL or the Python standard library defends more machines than any single enterprise security product. If Patch the Planet scales to 100+ projects and sustains a meaningful patch merge rate, it will have done something measurable.
But Patch the Planet is not the interesting question. The interesting question is what BIS is doing about a model that scores higher on the benchmark it used — or appeared to use — to justify pulling Anthropic's models offline.
The CyberGym threshold question
Windows News analysis flagged this directly: CyberGym scores may now function as a de facto regulatory trigger for export controls on advanced AI cybersecurity models. The reasoning is circumstantial but structurally sound. Mythos 5 scored approximately 84.3% on CyberGym pre-release. Anthropic gave the US government advance access via Project Glasswing — exactly the cooperative posture the Trump AI executive order was designed to encourage. BIS served a directive anyway, citing cybersecurity capability concerns alongside the jailbreak. Mythos 5 is on day 13 offline.
GPT-5.5-Cyber scored 85.6%. Same benchmark. Higher score. No directive.
There are three possible explanations.
One: CyberGym was not actually central to the BIS decision on Mythos 5. The jailbreak — a specific technique that unlocked Mythos 5's full Glasswing-tier capabilities through a commercial API — was the actual trigger, and CyberGym was a supporting data point. In that case, the benchmark comparison is misleading, but BIS has never clarified this publicly.
Two: CyberGym is a threshold, but GPT-5.5-Cyber's access controls are meaningfully stronger than Mythos 5's were. The Daybreak Trusted Access program is more restrictive than Anthropic's commercial API tier. In that case, BIS may have evaluated GPT-5.5-Cyber and decided the access model is sufficient. But BIS has not said this either.
Three: BIS applied its export control authority selectively — against Anthropic and not against OpenAI — for reasons that have nothing to do with CyberGym scores or access controls. This is the explanation that no one in government wants to say out loud, and the one that House lawmakers challenging the statutory basis of the Anthropic directive may find most useful.
Anthropic's dispute, and what it actually means
Anthropic's response to the CyberGym comparison deserves attention: the company claims CyberGym "overweights Windows-specific challenges — a design choice that advantages models trained on Microsoft's ecosystem." OpenAI has the deepest Microsoft relationship in the frontier AI industry: Azure, GitHub Copilot, Codex. If the benchmark that helped justify Mythos 5's suspension is structurally biased toward Windows environments where GPT-5.5's training data is richest, the 85.6% result may not generalize to Linux/cloud-native attack surfaces. It also means BIS used a benchmark that, if Anthropic's critique is right, was effectively pre-optimized for OpenAI's architecture.
I think the Anthropic critique is at least partially motivated reasoning. But "partially motivated" does not mean wrong.
What needs to happen
Someone needs to ask BIS directly, in writing: does GPT-5.5-Cyber require an export license for API access by foreign nationals under ECRA § 4817(b)(1) and EAR § 744.22(b) — the same statutes invoked against Anthropic? The question has a binary answer. If yes, OpenAI is operating under the same constraints as Anthropic and should be required to disclose this. If no, BIS needs to explain what distinguishes GPT-5.5-Cyber at 85.6% CyberGym from Mythos 5 at 84.3% CyberGym in a way that justifies the asymmetric treatment.
There is a third outcome: BIS does not respond, or responds with a non-answer. That is also informative. It would suggest the agency is administering export control authority over frontier AI models without a coherent, articulable standard — which is its own story.
The NSA and CISA directors issued a joint statement on June 23 — one day after Patch the Planet launched — warning that AI-enabled attacks are "months, not years" away. Whether that was coordinated with OpenAI's announcement or coincidental is not confirmed. If coordinated, it suggests the national security community is comfortable with GPT-5.5-Cyber's deployment under current access controls. If coincidental, it is a remarkable piece of timing that nobody in press coverage has connected to the export control question.
Patch the Planet found 51 real bugs in a week. GPT-5.5-Cyber can escalate to domain admin in 19 minutes. Both are true. The same model. The government pulled the model with a lower CyberGym score offline. The higher-scoring one is patching cURL.