Every model behind an API is already leaking

Opening Claim

Anthropic has publicly accused Alibaba of illicitly extracting Claude’s model capabilities. Take the headline at face value for a second, then set it aside, because the part that matters for anyone building real systems has almost nothing to do with Claude. It has to do with the fact that capability extraction works at all - and it works because the boundary between “the model” and “your system” is an API call, not a wall. People keep treating that boundary as a security guarantee. It was never one.

Strip the story down to its actual mechanics and you get something uncomfortable. A frontier model exposed through an interface will leak its behaviour to anyone willing to query it systematically. That is not a flaw in Claude specifically. It is a property of every LLM served behind an endpoint. The model produces outputs, outputs are observable, and observable behaviour can be captured, structured, and used to train something that mimics it. The vendor can write terms of service, add rate limits, and watch for anomalies. None of that changes the underlying physics: if behaviour goes out the door, behaviour can be harvested.

So the real claim of this post is simple. The breach being described - whatever its final legal shape - is an architecture problem before it is a model problem. The exposure exists because of how the system around the model was designed: what it allowed in, what it let out, and who, if anyone, was watching the flow. When you assume the model is inherently secure, you stop designing the only layer you actually control. That assumption is the failure. Everything downstream is just the bill arriving.

The Original Assumption

Walk into most teams shipping AI features and you find the same unspoken belief: the model is a black box that the vendor has made safe, and your job is to call it. Security, alignment, containment - all of it is treated as someone else’s responsibility, baked into the weights, handled upstream. Under that assumption, integration is just plumbing. You wire the model into a pipeline, hand it some tools and some data, and trust that nothing important escapes because the provider is a serious company with serious controls. The model is presumed secure, therefore the system is presumed secure. The two get treated as the same statement. They are not.

The mechanism that breaks this assumption is old and well understood: distillation. Query a capable model enough times, capture the inputs and the outputs, and train a smaller model to reproduce the mapping. You do not need the weights. You do not need the training data. You need access to the behaviour, which the API hands you by design every time you make a call. The capability leaks through the exact door you left open - the output channel - because that channel is the product. A “secure model” served through an endpoint with no egress accounting is a vault with the door propped open and a sign that says please don’t.

Undocumented, complex integrations make this worse, and this is where the original assumption does the most damage. When nobody has written down what the model is allowed to receive and what it is allowed to emit, there is no boundary to enforce. There is no output logging, no rate-of-extraction monitoring, no schema constraining what leaves the system, no record of which client pulled how much behaviour over what window. The integration grew organically, one feature at a time, and security was never a layer in it - it was assumed to live inside the model. So the system has no idea it is being drained, in either direction, because it was never built to know. The assumption that the model is inherently secure is precisely what stops anyone from building the controls that would catch the problem.

What Changed

What changed is the economics, and economics is what makes a risk permanent instead of theoretical. Training a frontier model costs hundreds of millions of dollars in compute, data, and talent. Distilling a useful slice of its capability through an API costs a tiny fraction of that - sometimes a few thousand dollars of inference. When the cost of copying behaviour sits that far below the cost of creating it, extraction stops being an edge case and becomes a rational strategy. The incentive does not go away because one company got named. It is structural. As long as capability is expensive to build and cheap to observe, someone will try to observe their way to it.

The second thing that changed is who has to care. For most teams, this story reads as a fight between large labs over intellectual property, something happening above their heads. That framing misses the transferable lesson. The same boundary that failed to protect a vendor’s capability from extraction is the boundary you are relying on to protect your system from the model’s failures - its hallucinations, its inconsistent outputs, its willingness to do whatever a cleverly shaped input asks. If the API boundary cannot keep capability in, it cannot keep risk out either. Both directions are governed by the same layer, and that layer is yours to design, not the vendor’s.

So the question every builder should be asking changes shape entirely. “Is the model secure?” is the wrong question, because it offloads control to a place you have no authority over and treats a probabilistic system as if it came with deterministic guarantees. The question that actually maps to something you can control is: what does my system allow to flow in and out, at what rate, validated by whom, logged where, and shut off under what conditions. The Alibaba accusation is useful exactly because it makes that distinction concrete. It is not a warning about a model. It is a warning about every system that wrapped a model and called the job done. The work that remains - egress control, output validation, rate and anomaly monitoring, defined interfaces around a non-deterministic core - is architecture. It always was.

How the Boundary Erodes

The failure here is gradual, not a single dramatic breach. An integration starts as one model call and grows one feature at a time, and every feature opens another path for behaviour to leave. Because the model was assumed secure, nobody keeps a ledger of what actually exits. Think of the system as having two edges: an input edge and an output edge. The output edge is the product, so by default it is maximally permissive - it returns the model’s full behaviour on every call, with no accounting of how much capability a given client has pulled over time. Extraction is just legitimate usage repeated with intent. At the level of a single request it is indistinguishable from a real query, because it is one. The only signal lives in the aggregate - rate of distinct prompts, coverage across the capability space, volume over a window - and nobody computes the aggregate, because the architecture was never told the aggregate mattered.

The shape of the output makes this better or worse, and most systems pick worse without deciding to. A model returning unconstrained natural language hands a distiller the richest possible training signal: full reasoning, full phrasing, the complete input-to-output mapping. Constrain that output to a schema - a label, a score, a bounded set of fields - and you shrink the behavioural surface you expose per call. This will not stop a determined, well-funded adversary, and it is not meant to. It stops you from gift-wrapping the capability. The same logic governs your own data flowing the other way: an output channel with no validation and no redaction emits whatever the model produces, which sooner or later includes something that should never have crossed the edge.

The drift itself is the core mechanism, and it is almost boring. Security was assumed to live in the weights, so no layer in your system owns the boundary. There is no single place where input is checked and output is counted, because the interface was never defined - it accreted, call by call, feature by feature. With no defined interface, there is nothing to instrument: no egress counter, no per-client extraction budget, no baseline for what normal flow looks like. A system with no concept of normal cannot detect being drained. The failure is not that someone picked a lock. It is that the building was constructed with openings instead of doors, and no one was counting what passed through them.

The Same Hole in Your Own Stack

The useful move is to flip the roles. In the headline, Anthropic is the party whose capability leaked. In your system, you are Anthropic. Picture a company that builds a support agent over its internal knowledge base - pricing logic, troubleshooting trees, years of hard-won operational answers - and exposes it through a chat widget. The architecture is identical to the one under discussion: a capable model behind an endpoint, outputs fully observable, no egress accounting, no defined interface. A competitor queries it methodically and reconstructs the knowledge base from the answers. The proprietary asset was never the model the company rented; it was their own documents, and those documents leaked through the exact mechanism named in the accusation. Smaller scale, same flaw, and now it is your problem rather than a frontier lab’s.

The same undesigned boundary fails in the other direction too, and that is where it gets dangerous fast. Prompt injection is not a separate category of problem - it is this problem viewed from the input edge. A retrieval system that ingests untrusted content - a web page, an inbound email, an uploaded document - hands the model attacker-controlled text dressed up as trusted context. If the model holds tools - send mail, query a database, hit an internal API - an instruction smuggled inside that content can drive those tools. The blindness is the same blindness: the API boundary was treated as a guarantee, so nothing validated what crossed it in either direction. Egress control and input validation are not two projects. They are one discipline seen from two sides of the same edge.

Tools are where the pattern stops being about leakage and starts being about consequence. Add tools and the output channel is no longer text - it is action. The unaccounted boundary does not merely emit behaviour; it executes. A model wired to a tool that writes to an external service, with no validation layer between the model’s intent and the actual call, is an exfiltration path with a motor attached. The pattern generalises cleanly: anywhere a probabilistic model sits behind an interface you did not deliberately design - defined inputs, constrained outputs, rate accounting, real logging - you are exposed to the same opening Alibaba is accused of walking through. The only variables are the scale of what leaks and who notices first. The architecture is constant.

The Boundary Is Yours or It Doesn’t Exist

Here is the part that does not soften. The model was never your security boundary, and it cannot become one no matter how good it gets. The vendor secures the weights, the training process, and their own infrastructure. They do not secure - they cannot even see - the system you built around their endpoint. The boundary that decides what your system admits and what it emits is yours and only yours. If you did not design it on purpose, you do not have one. You have a default, and the default on an output channel is emit everything, because emitting everything is the product you are paying for.

What that demands is unglamorous and entirely inside your authority. Define the interface so there is a single place to enforce. Constrain the outputs so each call surrenders less. Account for egress so the system has a concept of how much behaviour and data have left, per client, over time. Log the flow so you can reconstruct what happened. Decide, in advance, the conditions under which you cut the connection. None of this needs a better model or a vendor’s permission. It needs you to treat the system wrapped around the model as the thing you are actually accountable for. The accusation in the headline is a large-scale demonstration of what happens when even the most capable organisations in this field still file that layer under someone else’s responsibility.

So retire the question. “Is the model secure?” points at something you do not control, cannot inspect, and cannot change, and it quietly assumes a probabilistic system ships with deterministic guarantees it never had. The question that maps to something you own completely is this: what does my system allow to flow, at what rate, validated by whom, logged where, and shut off under what conditions. The breach in the headline, whatever a court eventually decides, collapses to one sentence - someone designed a boundary, or they did not. The model is not the system. The system is the part that was always yours to build.

Contains a referral link.

Every model behind an API is already leaking

Opening Claim

The Original Assumption

What Changed

How the Boundary Erodes

The Same Hole in Your Own Stack

The Boundary Is Yours or It Doesn’t Exist

Keep Reading

They walked out with the blueprints, not answers

GLM 5.2 lands; reasoning improves, refusals don't

Willison's lethal trifecta exfiltrates Claude uploads

Stay in the loop