Benchmarking Anthropic's Secretive 'Mythos' Bug-Finder Against Rival Models

Anthropic has restricted access to Mythos, a model it claims is exceptionally good at finding security vulnerabilities, ostensibly because the capability is too dangerous to release broadly. The author is skeptical of that framing and suspects the real reason is operating cost, so they built a benchmark to test whether the hype holds up. The method: collect bugs Mythos is documented to have found, locate the pre-fix commit, confirm a top model (Opus 4.7) can understand each bug when pointed straight at it, then measure how well other models detect the same flaws going in blind. The corpus currently holds nine confirmed real-world bugs, all believed to post-date the models’ knowledge cutoffs, with models given the target file and basic tooling but no hint about what to look for.

The headline result is that every model performed worse than expected — these bugs, especially the multi-file ones requiring cross-context reasoning, are genuinely hard to find. No model clearly replicated what Mythos reportedly does, lending some credence to the claim that it has unusual tooling (fuzzing, debuggers) or capability. Surprisingly, open and local models held their own: Gemma 4 MoE detected 4 of 9 with perfect precision, beating Google’s flagship commercial offerings, though it benefited from repeated attempts after server crashes. Apparent leaders like GPT 5.5 Pro and the Qwen models were skewed upward by incomplete runs (one blew a $100 budget after four cases). Running models inside full vendor agents didn’t improve results and usually cost far more, so only Claude models stayed in an agent harness.

The sharpest finding is operational rather than statistical: Google’s Antigravity CLI (agy) for Gemini refused to perform security analysis in eight of nine cases, flatly rejecting prompts even after softening language like ‘exploitable’ and ‘vulnerable.’ The author had to pay separately for Google AI Studio API access despite holding a subscription that should have covered it, and concluded Antigravity is simply not fit for security work. The author stresses the data is sparse — a single run per bug per model — so it proves nothing definitive, but it offers an early, useful signal that bug-finding ability varies widely across models and that guardrails can make some tools useless for legitimate security auditing.