RC RANDOM CHAOS

Engineering teams keep granting agents production database writes

AI agent vulnerabilities are systems engineering failures, not security failures. The fix is architectural containment, not better prompts or guardrails.

· 10 min read

Straight Answer

AI agent vulnerabilities are not primarily a security problem. They are a systems design problem wearing a security costume. The exploits everyone writes about - prompt injection, tool abuse, jailbreaks, indirect injection through retrieved documents - are downstream symptoms. The upstream cause is that agents are routinely deployed with capabilities that exceed their constraints, autonomy that exceeds their validation, and trust boundaries that exist only on architecture diagrams and not in the code.

If you wired a junior contractor into your production database with full write access, no code review, no audit log, and a job description written in natural language that anyone on the internet could amend mid-shift, you would not call the resulting incident a security failure. You would call it a hiring failure. Agent vulnerabilities work the same way. The model did not betray you. The system around the model gave it permission to.

The practical consequence is that patching prompts, adding guardrail classifiers, and bolting on output filters will not close the gap. They reduce surface area at the margin. The structural gap - that the agent is the policy enforcement point for actions it has no reliable way to evaluate - stays open. Treating this as a security workstream produces detection. Treating it as a systems engineering workstream produces containment.

What’s Actually Going On

Most production agents are built on the same loose pattern: a model receives a goal, has access to a set of tools, runs in a loop, and decides which tools to call until it believes it is done. The model’s decisions about which tool to call, with which arguments, in which order, are made entirely from text - its prompt, its tool descriptions, its memory, and whatever data has been pulled into context during the run. There is no separation between instructions and data. Every token in the context window has equal authority to influence the next action. That is the actual vulnerability surface, and it exists by design, not by accident.

When an attacker exploits an agent, they are not breaking through a wall. They are writing into a field that the system already reads as authoritative. A poisoned webpage, a malicious calendar invite, an email signature, a row in a CRM, a comment in a Jira ticket - anything the agent ingests becomes part of the operating instructions for the next step. The model cannot reliably distinguish between “the user asked me to do this” and “a document I retrieved contains text that asked me to do this.” Both arrive as tokens. The decision-making layer treats them the same way because, structurally, they are the same thing.

This is why the conversation needs to move out of the security team and into the architecture review. The interesting question is not “how do we stop prompt injection?” The interesting question is “why does a single component hold the authority to read untrusted data, decide what action to take, and execute that action against a privileged system, all in one uninterrupted loop?” In any other engineering discipline that combination would be flagged on the first review. We accept it in agent systems because the model’s fluency makes the loop look like reasoning. It is not reasoning. It is token prediction with side effects.

Where People Get It Wrong

The first mistake is treating the model as the trust boundary. Teams write detailed system prompts, add “never do X” instructions, layer on guardrail models, and assume the resulting stack will hold. It will not. A system prompt is a suggestion the model usually follows. It is not a permission system. If the action is dangerous when performed by an attacker, it has to be impossible to perform - not improbable, not discouraged, not filtered after the fact. The trust boundary belongs in the tool layer, the API layer, and the data layer, where it can be enforced by code that does not negotiate.

The second mistake is conflating agent autonomy with agent capability. A team gives an agent the ability to send email, query the database, call internal services, and write to shared storage, then assumes that giving it a careful goal and good prompts will keep it inside the lines. The agent’s blast radius is defined by the union of every tool it can call, not by the narrower task it was supposedly given. If you would not be comfortable with a stranger holding that exact set of credentials with no supervision, you should not be comfortable with the agent holding them either, because under prompt injection that is functionally what you have built.

The third mistake is mistaking detection for containment. Adding a classifier that flags suspicious outputs, a logging layer that records tool calls, or a human-in-the-loop step that reviews actions after the fact does not change what the agent can do - it only changes how quickly you find out it did something wrong. Detection is necessary, but it is not the control. The control is the set of things the agent is structurally incapable of doing: actions it cannot take because the tool does not exist in its registry, scopes it cannot reach because its credentials do not grant them, data it cannot exfiltrate because the egress path is not wired up. Everything else is a hope, and hope is not an engineering primitive.

Mechanism of Failure or Drift

The failure mode in agent systems is not a single broken component. It is a slow collapse of the boundary between instruction and data, repeated thousands of times per run, until the system loses any coherent definition of who is in control. Every iteration of the agent loop concatenates new material into the context window - tool outputs, retrieved documents, intermediate reasoning, sub-agent responses - and feeds the result back into the same model that is also responsible for deciding the next action. There is no checkpoint where the system stops and asks whether the new context is trustworthy. The loop just runs. Drift is not an edge case in this design. Drift is the default state, and stability is the exception that requires deliberate engineering to maintain.

The second mechanism is capability accumulation across the lifetime of a deployment. Agents start narrow. A team builds one with three tools to handle a single workflow, ships it, and it works. Then the next ticket adds a fourth tool, because the workflow expanded. Then a fifth, because a related team wanted to reuse the same agent. Then a sixth, because someone discovered that with database write access the agent could close the loop on a slightly different task. Six months later the agent has a tool registry that no one has audited as a whole, with a combined blast radius nobody has modelled, exposed to inputs from sources nobody has enumerated. Each individual decision was reasonable. The aggregate is a privileged generalist that can be steered by a paragraph of attacker-controlled text in a retrieved document.

The third mechanism is validation atrophy. In the first weeks of a project, engineers are paranoid. They review tool calls, sample outputs, run red-team prompts, write evals. As the system stabilises and the team moves on, that scrutiny is replaced by dashboards and alerts. The dashboards show success rates, latency, cost. They do not show the decisions the agent did not make safely - the close calls, the actions that were technically permitted but operationally wrong, the slow shift in tool-call distribution as the underlying data sources changed. By the time something visibly breaks, the agent has often been operating in a degraded trust state for weeks. The exploit is the moment the degradation becomes legible. The vulnerability has been live the entire time.

Expansion into Parallel Pattern

This pattern is not unique to AI agents. It is the same shape that produced the worst incidents in microservice architecture, in CI/CD pipeline design, and in the early days of cloud IAM. In each of those domains, a flexible primitive - a service mesh, a build runner, an IAM role - was given enough capability to be useful, then enough trust to be convenient, then enough surface area to be catastrophic. The breaches that followed were never described as failures of the primitive itself. They were described as failures of how the primitive was wired into the rest of the system. The same vocabulary applies to agents. The model is the primitive. The wiring is the failure.

The analogy worth holding onto is the early CI/CD breach pattern, where build runners were given production credentials so they could deploy. The runners were not malicious. The pipelines were not unusual. But because a build runner could be triggered by a pull request, and a pull request could come from anyone with commit access, and a commit could modify the build script, the chain produced a path from external attacker to production secrets that nobody had drawn on a diagram. Agent systems reproduce this exact topology. The agent is the runner. The retrieved document is the pull request. The tool registry is the production credential set. The injection is the modified build script. The industry took roughly a decade to arrive at scoped tokens, ephemeral credentials, signed artifacts, and isolated runners. Agent infrastructure is currently somewhere around year two of that same curve.

The useful move is to stop inventing new mental models for agent security and start porting the ones that already work. Treat the agent as an untrusted execution context, not a trusted decision-maker. Treat its tool calls as outbound requests from a low-privilege service, subject to the same authorisation, rate limiting, and scope checks any other internal client would face. Treat its inputs as user-controlled data, even when they come from internal systems, because internal systems contain user-controlled data. Treat each tool as an API endpoint that has to defend itself, not as a function that can assume its caller is acting in good faith. None of this is novel engineering. It is the application of standard practice to a component that has been temporarily exempted from it because it speaks fluently.

Hard Closing Truth

The agent is not the problem and the attacker is not the problem. The problem is that the architecture grants a probabilistic component the authority of a deterministic one, and then expresses surprise when the outcomes are probabilistic. A model cannot be hardened into a security boundary. It can be made more compliant, more aligned, more resistant to specific classes of input, but at the level where it actually matters - the question of whether a given token sequence will or will not produce a given action - it remains a statistical system. Statistical systems do not enforce policy. They approximate it. Approximation is fine when the cost of being wrong is small. It is not fine when the cost of being wrong is a wire transfer, a deleted table, or a leaked customer record.

The shift that closes the gap is structural, and it is unglamorous. Tools must be the policy layer. Each tool defines, in code, what it will accept, from whom, with what scope, under what rate limits, against what data. The agent calls into this layer the same way any other client would, and the layer enforces what is permitted regardless of how persuasively the agent argues otherwise. Credentials are scoped to the smallest set of operations the workflow actually requires. Untrusted content is tagged at ingestion and never silently promoted to instruction status. Sensitive actions require a second factor that the model cannot satisfy on its own - a human, a separate service, a signed approval. The agent loses some flexibility. The system gains the ability to survive an attacker who has already won the prompt-level battle.

The organisations that will operate agents safely at scale are not the ones with the best prompt engineering or the most sophisticated guardrail stack. They are the ones who decided, early, that the model is a component, not a colleague. They built the surrounding system to assume the model would, eventually, be wrong about something important. They invested in containment instead of trust. The exploits in the news are loud, specific, and easy to write about, and they will keep coming. The quieter story is that the teams losing to these exploits made the architectural decision to lose months or years before the incident, when they let a generative system hold capabilities that no probabilistic component should ever hold without a deterministic layer underneath it. That decision is the vulnerability. Everything else is just the day it gets exercised.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.