6,000 Emails, Zero Leaks: A Public Prompt-Injection Test the AI Survived

Developer Fernando Iglesias put his OpenClaw-based email assistant, Fiu, on the public chopping block at hackmyclaw.com, daring anyone to coax it into leaking a secrets.env file. After the project hit the Hacker News front page, more than 2,000 people fired off over 6,000 attempts—authority impersonation (fake OpenClaw admins, compliance audits, incident-response emergencies), multi-language social engineering, and rapid-fire variations. None succeeded. The defensive setup was deliberately minimal: a handful of plain-English rules telling the agent never to reveal credentials, modify its own files, run commands from emails, or exfiltrate data.

The experiment surfaced practical lessons beyond the headline result. Google suspended the assistant’s Gmail for three days after mistaking the flood of traffic for fraud, and the token costs topped $500. Batch-processing emails turned out to taint results—once the agent saw a few obvious injections, it grew suspicious of everything after, so each message had to be handled in a fresh context. The model itself even reasoned about the situation, noting in memory that the volume looked like a coordinated security exercise, and at one point flagged a congratulatory message as a possible rapport-building ploy.

Iglesias credits the outcome largely to model choice: the test ran on Claude Opus 4.6, which Anthropic has specifically hardened against prompt injection, and he expects weaker models would fold more easily. His honest caveats matter—single-shot emails are far less dangerous than sustained multi-turn conversations, which budget constraints prevented him from testing. The takeaway is measured optimism rather than an all-clear: prompt injection remains a real risk, and he still wouldn’t hand an AI agent arbitrary permissions, but a capable model with a few clear instructions proved far more resilient than he’d anticipated.