RC RANDOM CHAOS

Willison's lethal trifecta exfiltrates Claude uploads

Technical analysis of indirect prompt injection against Claude AI agents - exfiltration mechanics, ATT&CK mapping, telemetry gaps, residual exposure.

· 7 min read
Willison's lethal trifecta exfiltrates Claude uploads

A live demo circulating on security research channels shows arbitrary file exfiltration from Claude AI sessions through indirect prompt injection against tool-augmented agents. No CVE assigned at time of writing. The class is not new. CWE-77 command injection logic applied to LLM tool routers, paired with CWE-79-style rendering of attacker-controlled output in the client. The novelty is that the user takes no action beyond uploading a file and asking a question. The exfil channel runs through the assistant’s own permitted tool surface.

The bug class is indirect prompt injection against a tool-using agent. The Claude product surface in scope is any deployment where the model has concurrent access to three capabilities - read access to user data such as uploaded files or connected drives, ingestion of untrusted content such as a fetched web page or a document containing attacker-controlled text, and an outbound communication primitive such as a rendered hyperlink, an image URL the client auto-fetches, or an MCP tool that performs HTTP. Simon Willison has catalogued this as the lethal trifecta. The arrangement is not a vulnerability in any single component. It is an emergent capability composition that defeats the implicit trust boundary between user instructions and document content.

The mechanism. A user uploads a document. The document contains adversary-controlled text positioned to look like a system note, a comment, or invisible whitespace-cloaked instructions. The model has no robust signal separating the user’s prompt from text it reads inside the tool result. The transformer sees a single token stream. When the injected instructions say read the contents of file X, base64-encode them, append them as a query string to https://attacker.tld/log.png, and render the result as a markdown image, the model complies because nothing in its alignment training treats document-origin instructions as lower trust than user-origin instructions. The compliance rate varies by model version and system prompt hardening, but it is non-zero, and a single successful generation is sufficient.

The exfil primitive depends on the client. In a chat surface that renders markdown, an image tag with a remote src triggers an outbound GET the moment the message paints. The user sees a broken image or a one-pixel transparent PNG. The attacker’s log line carries the file content in the URL. In agent deployments with MCP-connected tools, the model invokes a permitted HTTP tool directly. The payload travels in a tool call argument the user never inspects. In computer-use mode, the model can open a browser, paste data into a form, and submit. Each variant produces the same outcome - silent transfer of file material to an attacker endpoint inside what the user perceives as a normal conversational turn.

Mapping to MITRE ATT&CK is straightforward once the agent is treated as an execution endpoint. Initial access is T1566.001, spearphishing attachment, when the malicious document is mailed in. T1195.002 covers the supply chain variant where a compromised MCP server returns instructions inside an ordinary tool result. Execution is T1059, command and scripting interpreter, with the LLM as the interpreter and natural language as the language. Collection is T1005, data from local system, performed by the model on the user’s behalf using the file tools the user authorised. Exfiltration is T1567.002, exfiltration to web service, or T1041 when the channel is the model’s own API tool. C2 is T1071.001, web protocols, rendered through whatever HTTP-capable surface the agent holds.

The attacker reaches the condition by controlling any text the model will read during a task. PDF metadata fields. Comments in source files. HTML inside a webpage fetched through a browse tool. White-on-white text in a Word document. ASCII tag content in an SVG. Alt text on an image the model is asked to describe. Spreadsheet cells outside the visible range. Repository README files when the agent has GitHub MCP access. Calendar event descriptions when the connector pulls them. Email bodies when an inbox is connected. The trust assumption the product is making is that content inside a tool result is data. The transformer treats it as instructions. That mismatch is the bug.

The payload structure is mechanical. A directive section that overrides prior context with phrasing the model has learned to obey. A target specification that names a file path, a connector resource, or a context window region. An encoding step that converts the target to a transport-safe form, typically base64 or hex. A delivery template that emits a markdown image, a tool call, or a clickable link the user is socially engineered to follow. The full payload fits in a paragraph. It does not require knowing the user’s identity, prompt history, or system instructions. It only requires landing in a document the model will process.

Real-world exploitation status. Public proofs of concept exist against multiple frontier model deployments including Claude, ChatGPT, Gemini, and Copilot. Researchers including Rehberger, Willison, and Embrace The Red have demonstrated exfil against production surfaces with disclosure to vendors. Anthropic has shipped mitigations across releases - narrower tool permissions, markdown image rendering controls in Claude.ai, link safety warnings, MCP server prompts. No mitigation eliminates the class. The defence has moved from absent to partial. Threat actor adoption in named campaigns has not been publicly attributed at the level of a Lazarus or APT29 callout, but the technique is operationally trivial and is the obvious next step for any actor already running phishing infrastructure. Treat absence of attribution as a reporting gap, not a capability gap.

Telemetry. This is where defenders are blind. The conversation between user and model occurs inside a vendor-controlled boundary. Most enterprise deployments do not log the full token stream. They log API metadata - request count, token count, latency. The tool calls are visible if the deployment captures them, but the calls themselves look legitimate. The model fetched a URL. The model read a file. The model rendered an image. Each action is within the permission set the user granted. Sysmon does not see this. EDR does not see this. The browser fetch of an image src lands in a proxy log as a GET to an external host, indistinguishable from a thousand other image loads, with the file contents encoded into a path the proxy does not parse as suspicious. Network DLP that pattern-matches on plaintext PII will miss base64. Network DLP that decodes base64 may catch fragments depending on chunking. SIEM correlation that flags new external domains will catch one-off attacker infrastructure but misses anything routed through a CDN, a Discord webhook, a public paste service, or a compromised but reputable host.

What fires. CASB and SSE platforms that proxy outbound traffic from the workstation can log the image fetch with full URL, including the encoded payload. DNS query logs will show resolution of attacker domains if they are novel. Browser isolation products may surface the rendered link. What does not fire. Endpoint detection of process behaviour - no new process spawns, no LSASS access, no token manipulation. The exfiltration runs inside the browser process or the Claude desktop client, both of which are signed, trusted, and behaving as designed. The MITRE D3FEND mapping for detection points to D3-NTA, network traffic analysis, and D3-DLIC, document-level integrity checking on documents entering the pipeline, neither of which is widely deployed against LLM input surfaces.

The residual exposure after current mitigations. Claude’s hosted surfaces have constrained markdown image rendering and added link warnings. MCP servers operating outside Anthropic’s control remain permission-broad by default. Enterprise deployments that wire Claude into Slack, Jira, GitHub, Google Drive, and a custom data lake have built a tool graph where any of those sources can inject instructions and any of them can receive exfil. The model’s safety training reduces compliance rate on overt malicious instructions. It does not reduce it to zero on instructions framed as user clarifications, debug notes, or formatting directives. Adversarial robustness research from Anthropic’s own teams documents the gap. Constitutional AI raises the bar. It does not close the door.

The patch boundary on this class is not a version number. It is an architectural decision about whether tool-using agents enforce a hard separation between instruction-trust content and data-trust content at the routing layer, not at the model layer. Until that separation is mechanical - segmented context windows, capability tokens scoped per data source, explicit user confirmation gates on cross-source data flow - the lethal trifecta condition will continue to produce silent exfiltration against any agent that holds all three capabilities concurrently. The fix lives in the agent framework, not in the prompt. Hardened system prompts reduce the rate. They do not eliminate the primitive. Treat any LLM agent with file access, untrusted content ingestion, and outbound network reach as a data exfiltration capability already deployed inside the perimeter, and architect downstream controls accordingly.


Contains a referral link.

See also: NordVPN for tunneled traffic when operating outside controlled networks.


#ad Contains an affiliate link.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.