Cloudflare's CISO spent two weeks breaking Mythos
Cloudflare's CISO red-teamed Anthropic's Mythos LLM. The findings on harness design, memory persistence, and tool allowlists matter more than the model itself.
Cloudflare’s CISO spent two weeks running Anthropic’s Mythos through a production-style red team. The writeup is short, technical, and not flattering in the places where it matters. It’s also not the disaster some headlines made it out to be. The interesting part is which controls held, which didn’t, and what that says about deploying frontier models inside a regulated network.
Below is what the review actually surfaced, translated out of vendor language, and what a defender should do about it before Mythos shows up in a procurement deck.
What Mythos Actually Is
Mythos is Anthropic’s first model marketed primarily on agentic workloads - long-horizon tool use, file system access, browser control, and persistent memory across sessions. It’s not a chatbot with extra steps. The threat surface is different because the model is allowed to take actions, not just generate text.
That matters for one reason. A prompt injection against a chatbot produces a bad sentence. A prompt injection against an agent produces a bad action. The Cloudflare review is mostly about that second category, and most defenders are still budgeting for the first.
The Tooling Setup Mattered More Than the Model
The review ran Mythos in three configurations: vanilla API, Anthropic’s reference agent harness, and a Cloudflare-internal harness with their own sandboxing. Vulnerability counts were not close.
Vanilla API: behaves roughly as advertised, refusal rates within the published bands.
Reference harness: instruction hierarchy degrades when tool outputs contain adversarial content. Specifically, when a webpage Mythos fetches contains a hidden instruction block, the model treats it as roughly equivalent in priority to the developer system prompt about 8% of the time across the test set.
Internal harness: same model, same prompts, the injection rate drops below 1% because Cloudflare wrapped tool outputs in explicit untrusted-content tags and ran a smaller classifier in front of every tool result.
The lesson isn’t that Mythos is broken. The lesson is that the model is one component, and the harness around it is doing most of the security work. Most teams deploying Mythos will use something closer to the reference harness because building your own is expensive. They’ll inherit the 8% number and not know it.
Memory Persistence Is the Quiet Problem
Mythos ships with a persistent memory feature where the model can write notes to itself across sessions. The review found three issues here that aren’t in the model card.
First, memory writes don’t pass through the same safety classifiers as user-facing outputs. An attacker who gets one successful injection can write content to memory that influences sessions days later, including sessions belonging to other users if the deployment shares a memory pool.
Second, there’s no built-in retention policy. Memory grows until the operator manually prunes it. Cloudflare’s test deployment accumulated 340MB of model-generated notes in twelve days, including verbatim copies of documents the model had been asked to summarize. That’s a data classification problem nobody is set up to handle.
Third, the memory contents are stored as plain text in whatever backend the operator configures. The default examples in Anthropic’s docs use a local SQLite file. If your threat model includes someone reading files on the host running the agent, your threat model now includes them reading every document the agent has ever processed.
Fix order: disable memory by default, enable per-tenant with explicit retention windows, encrypt at rest, run the same content filters on writes that you run on outputs.
The Tool Allowlist Is Doing Real Work
The review’s most useful section is on tool design. Mythos will call any tool you give it. The model is not the access control layer. Three patterns broke during testing:
A read_file tool with no path restriction. The model, when prompted by an injected webpage, read /etc/passwd and exfiltrated it via a subsequent http_get call. Neither tool was misbehaving. The composition was the vulnerability.
A send_email tool that accepted arbitrary recipients. After a multi-step injection chain, the model emailed a summary of an internal document to an attacker-controlled address. The model thought it was being helpful.
A run_shell tool gated by a regex allowlist. The regex was bypassed using shell features the allowlist author hadn’t considered (command substitution inside what looked like a grep call).
None of these are model failures. They are tool design failures that the model happily exposed. The defensive posture is the same posture you’d take with any untrusted automation: least privilege per tool, explicit allowlists on destinations, no shell execution without a sandbox, and assume the model will eventually be tricked into calling every tool you give it.
Refusal Behavior Is Inconsistent Under Load
One finding that surprised the review team: Mythos’s refusal rate on dangerous-capability prompts drops when the context window is heavily loaded with legitimate work. With 4K tokens of prior conversation, the model refused 96% of red-team prompts in the standard set. With 180K tokens of legitimate prior conversation, the same prompts got through at 81%.
The model isn’t being jailbroken in the classic sense. It’s that safety behaviors appear to be more brittle when the model is deep into a complex task. This matches what’s been published about long-context degradation in other capabilities, but it’s the first time I’ve seen it documented for refusals specifically.
Operationally: if you’re running Mythos in long-horizon agent loops, run an out-of-band classifier on outputs. Don’t rely on the model’s own refusal as the last line of defense. The Cloudflare team added a small classifier that scored every tool call against a policy and rejected outright before the call executed. That caught about half of what got through the model’s own refusals.
What the Review Got Right and What It Missed
What it got right: the focus on harness, tools, and memory rather than the model in isolation. Most LLM security writing is still about extracting bad sentences from chatbots. This review is about an agent that can take actions, and the analysis follows the actual blast radius.
What it missed: supply chain. Mythos was tested as a hosted API. No discussion of what happens when the model is fine-tuned, when the weights are pulled into a different runtime, or when third-party MCP servers are added to the tool list. The MCP ecosystem is now where most of the interesting attack surface lives, and the review doesn’t address it.
Also missing: cost-based denial of service. An attacker who can influence prompts can run up an enormous bill. The review notes this in passing but doesn’t quantify it. In a test I ran separately on a similar agent setup, a single injected webpage caused a 40x increase in tokens consumed for a routine task. That’s a budget problem, an availability problem, and in some pricing models, a billing fraud problem.
What to Do Before Mythos Shows Up in Your Stack
Four things, in order:
One. Inventory every place an LLM agent could be deployed in your environment over the next year. Procurement, customer support, developer tooling, internal search. Each one is a different threat model. Don’t write a single AI policy. Write one per deployment pattern.
Two. Define the tool allowlist before you pick the model. The security properties of an agent are mostly determined by the tools it can call, not by which vendor’s model is behind it. If your team can’t articulate which tools the agent gets and what the destination allowlist looks like for each one, the model choice is premature.
Three. Plan for memory as a regulated data store from day one. Whatever the model writes to memory, treat it as if a user wrote it: classification, retention, encryption, access logs. The fact that it’s machine-generated doesn’t change the compliance posture if the contents include customer data.
Four. Budget for an out-of-band policy classifier. The model’s own refusals are a useful first layer. They are not a sufficient last layer, especially in long-context agent loops. A small, fast classifier in front of every tool call is the single highest-leverage control the Cloudflare review identified, and it’s the one most teams will skip because it costs money and adds latency.
The Mythos review is worth reading in full. The takeaway isn’t that Mythos is dangerous or safe. The takeaway is that the security of an LLM agent is determined almost entirely by the harness, the tools, and the memory model around it - and most organizations procuring these systems are still evaluating them like chatbots.
Contains a referral link.
Keep Reading
prompt injectionThe contract you pasted is now giving orders
Large AI context windows turn conversations into unsecured databases, breaking DLP assumptions and opening prompt injection paths. Here's how to reassess the risk.
AI securityResearchers silently exfiltrate files from Claude sessions
A live demo shows files inside Claude AI chats can be silently exfiltrated. Operator briefing on what failed, what it exposes, and what must change.
LLM deploymentThe same AI you're shipping wrote the malware
10,000 trojan GitHub repos weren't a malware breakthrough - they prove LLM safety lives in the model while abuse happens in the unguarded pipeline.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.