RC RANDOM CHAOS

Copilot shipped CWEs in 40% of NYU's 2021 scenarios

Why working AI-generated code still gets rejected in security review: functional correctness is not security correctness, and CWEs ride through clean output.

· 6 min read
Copilot shipped CWEs in 40% of NYU's 2021 scenarios

GitHub Copilot generated vulnerable code in roughly 40 percent of security-relevant scenarios. NYU’s 2021 study, “Asleep at the Keyboard,” ran 89 completion scenarios mapped to MITRE’s CWE Top 25. Across 1,689 generated programs, about 40 percent contained a weakness. The code compiled. The unit tests passed. The function returned the expected value. It was still wrong in the way that ends up in an incident report.

That gap is the entire reason for the “AI works but I reject it” stance. Functional correctness and security correctness are different properties. The first asks whether the code does what the prompt described. The second asks what else the code can be made to do. A language model optimizes for the first. It carries no representation of the second unless that representation was in the training data and someone put it in the prompt. Working is the floor. It is not evidence of anything above the floor.

An LLM is next-token prediction over public code. It reproduces the median of its corpus, and the median of public code is insecure. Stack Overflow answers concatenate strings into SQL. Tutorials hardcode keys. Sample code disables certificate verification to make the demo run. The model learned all of it as the normal shape of a solution. When it completes a function, it regresses toward that shape. The output is idiomatic, plausible, and frequently carrying a CWE the author never sees. Plausible is the dangerous part - it reads like competence.

The failure modes are specific and repeatable. A query builder that concatenates a parameter into a SQL string works against every benign test value and satisfies CWE-89 the moment that value is attacker-controlled. Token generation built on Math.random, or any non-CSPRNG, returns a token of the right length and format, CWE-330, predictable to anyone who can sample the sequence. AES in ECB mode, or CBC with a static IV, encrypts and decrypts cleanly and leaks plaintext structure, CWE-327. A deserializer pointed at attacker-controlled input - pickle.loads, Java ObjectInputStream, unsafe YAML - round-trips the happy-path object and hands over a CWE-502 remote code execution primitive. A token check that verifies the signature but not the claims accepts alg:none or an unexpired, wrong-audience JWT and passes every positive test in the suite. Each one works. Each one ships a vulnerability.

Human-written vulnerable code usually looks off. Odd variable names, inconsistent style, a function that does too much. The reviewer’s eye catches the smell before the bug. AI-written vulnerable code looks clean. It matches the surrounding style, names things sensibly, and reads like code a competent engineer wrote on a good day. The visual signal that triggers scrutiny is gone. The reviewer’s pattern-matcher labels it a normal query function and scrolls. The vulnerability rides through on good formatting.

Stanford measured the second-order effect. Perry, Srivastava, Kumar, and Boneh, 2023, “Do Users Write More Insecure Code with AI Assistants?” Participants with an AI assistant produced less secure code across most tasks and were more likely to rate their insecure code as secure. The confidence moved the wrong direction. Less secure, more sure. That inversion is the hazard, because certainty is what shortens review. Nobody re-reads the code they already trust.

When the same model writes the code and the tests, the blind spot is doubled, not covered. The tests encode the author’s mental model, and here the author is the model. It generates a function that concatenates SQL and a test that sends a benign string and asserts a row comes back. Both are internally consistent. Both are blind to the quote-semicolon-comment the test was never going to send. Green CI in that setup means the code does what the model thought it should, verified by the same model. That is not independent verification. It is a closed loop validating its own assumptions.

Handing the review back to a model does not break the loop either. An LLM reviewing LLM output samples from the same distribution that produced the bug. It rates the idiomatic-looking query function as fine for the same reason it wrote one. Catching the defect requires an out-of-distribution check - an adversary’s mental model applied to the code, asking not whether this returns the right answer but what is the worst input that can be routed through this line. That model is not in the weights. It is in the reviewer who has run the incident.

This is where rejecting working code earns its cost. Hand-writing the function forces the author to hold the invariants - which value is attacker-controlled, where the trust boundary sits, which path must fail closed. The invariants live in the author’s head, not in the syntax. Accepting generated code inherits the output and discards the invariants. The code arrives with no model of what it must never do. Rejecting it and re-deriving the logic reconstructs the negative space, the set of states the function is required to refuse. Tests assert presence of intended behavior. The negative space is unbounded and untested by definition.

The supply chain version is worse, because there the broken state was the safe one. LLMs hallucinate dependency names. Research across 2024 and 2025 found models recommending packages that do not exist at a meaningful and repeatable rate, and attackers registering the hallucinated names to serve their own code. The phenomenon got a name, slopsquatting. Here the it-works heuristic actively misleads. The generated import fails because the package is missing, and the obvious fix, install the missing package, walks straight into attacker-controlled code. T1195, supply chain compromise. The state where the code did not run was the state that had not yet been compromised.

None of this surfaces at commit time. It surfaces in production as the alert nobody wanted. The injectable query becomes SQLi in the WAF logs, triaged after exfiltration. The deserializer becomes a T1059 interpreter spawned under the application server, visible in Sysmon Event ID 1 once someone goes looking. The predictable token becomes a hijacked session with zero failed-authentication events, because nothing failed - the attacker computed a valid one. Review is the cheapest point in the lifecycle to kill any of these. Every later stage costs more, through CI, staging, production, and incident response, in that order of increasing pain.

Log4Shell is the canonical case. CVE-2021-44228, CVSS 10.0. JNDI lookups embedded in logged strings functioned exactly as designed. The feature worked. The feature was the vulnerability. Functional correctness was complete and entirely beside the point. That is the principle in one CVE - working describes behavior, not safety, and the two are independent until a human proves otherwise.

The reject decision is also the cheapest one available. A defect killed at review costs the minutes spent reading it. The same defect found in production costs the incident - detection lag, IR hours, disclosure obligations, and whatever the attacker did inside the dwell window. The multiplier between those two points is not marginal. Accepting working-but-unreviewed code trades minutes now for an unknown, attacker-defined cost later. That trade only looks good while the build is green.

The stance is not anti-AI. It is a gate. Generated code enters as a proposal, never as a commit. It earns merge by passing the scrutiny applied to a pull request from an untrusted external contributor, because functionally that is what it is. Provenance unknown. Intent absent. Threat model not included. So I read every line. I model the inputs an adversary controls. I confirm the thing fails closed. I reject anything I cannot fully account for, even when every test is green, and especially when every test is green, because green is what tempts the skip.

The test suite proves the presence of intended behavior. It does not prove the absence of unintended capability. No checkmark covers the negative space, and the negative space is where exploitation lives. Working code that the reviewer does not understand is not an asset. It is unreviewed attack surface with a passing build, and accepting it is a decision to find out what it does in production.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.