The refund letter addressed to Dear [Name]

ChatGPT’s first attempt is a draft, not a deliverable. Treating it as production-ready is the single most common reason AI projects stall, break, or quietly bleed money in the background. The model is good. It is not reliable. Those are different properties, and confusing them is what separates a working system from an expensive demo.

The phrase “how hard can it be” usually shows up right before someone ships a workflow held together by a single prompt and a lot of optimism. A few weeks later, the same person is debugging why the model returned a markdown table on Tuesday and a JSON object on Wednesday, why a customer received a refund letter addressed to “Dear [Name]”, or why the cost of a simple summarisation task suddenly tripled because nobody capped the input size. None of those failures are model failures. They are systems failures dressed up as AI problems.

The gap between “it worked once in the playground” and “it runs ten thousand times a day without supervision” is enormous. That gap is where validation, schema enforcement, retries, fallbacks, observability, and cost controls live. Skipping it does not save time. It moves the cost from build to operations, where it compounds. The first attempt being usable is not a feature of the model. It is a coincidence you are about to depend on.

When ChatGPT first hit mainstream use, the assumption was reasonable on the surface: if the model can write production-quality code in a chat window, it can write production-quality code anywhere. If it can draft a flawless email once, it can draft ten thousand flawless emails on demand. The output looked clean, the interface was forgiving, and the failure modes were invisible to anyone not stress-testing the system. That created a generation of teams who confused fluency with reliability.

The deeper assumption underneath was that prompting is engineering. It is not. Prompting is instruction. Engineering is what happens around the instruction: defining inputs, constraining outputs, validating results, handling exceptions, monitoring drift, and isolating non-deterministic components from the parts of the system that must behave predictably. A prompt is a single line of business logic exposed to a probabilistic execution layer. Treating it as the whole system is the same mistake as writing one SQL query and calling it a database platform.

There was also a quieter assumption baked into early adoption: that models would keep improving fast enough to make sloppy integration irrelevant. The thinking went that if GPT-4 was good, GPT-5 would be better, and the cracks in your pipeline would simply close on their own. That is not how this works. Better models change the failure distribution, not the need for structure. A more capable model with no validation layer still hallucinates, still drifts, still produces malformed outputs, and still costs you when nobody is watching. Capability does not replace control. It raises the ceiling on what control is worth.

What changed is that real workloads exposed the difference between a model that can do a task and a system that can do a task ten thousand times in a row without supervision. Once teams moved past prototypes, the same patterns showed up everywhere: outputs that were 95 percent correct, which sounds impressive until you multiply it by volume and realise five percent of your customer-facing responses are wrong, malformed, or unsafe. The cost of cleanup quietly outgrew the cost of the model itself.

The second shift was the rise of structured outputs, function calling, and schema enforcement as first-class features rather than afterthoughts. The reason these exist is not because vendors got generous. It is because the industry collectively learned that free-form text is not an interface. Production systems need contracts. JSON schemas, typed responses, and validated tool calls turn a probabilistic component into something the rest of your stack can actually rely on. The teams that adopted this early stopped having mysterious downstream failures. The teams that did not are still writing regex to parse model output and pretending that is a strategy.

The third shift is operational. Latency budgets, token costs, retry logic, fallback models, and evaluation pipelines are now part of any serious AI deployment. “How hard can it be” gets answered the first time a model update silently changes output formatting and breaks a workflow that has been running fine for three months. The answer is: harder than it looks, and the difficulty lives in the parts nobody demos. The model is the easy part. Everything around it, the validation, the monitoring, the cost ceilings, the human review gates, the structured retries, is where the actual engineering happens. That is the work the first attempt skips, and that is the work that determines whether the system survives contact with reality.

The failure pattern is consistent enough to be predictable. A team builds a workflow around a prompt that worked. It runs fine for a week, maybe a month. Then something shifts. The model provider pushes a silent update, an input arrives slightly outside the shape the prompt was tuned for, or a downstream system starts receiving outputs it cannot parse. Nothing logs an error because there was no contract to violate. The pipeline keeps running, producing garbage that looks plausible, and the only signal is a customer complaint or a finance team noticing the bill. By the time anyone investigates, the drift has been compounding for weeks.

The mechanism underneath is straightforward. A prompt is not a function. A function takes typed inputs, runs deterministic logic, and returns typed outputs. A prompt takes a string, runs probabilistic inference, and returns another string that approximates the desired shape most of the time. When you wire that into a system without a validation layer, you have just inserted a non-deterministic component into a deterministic pipeline and trusted it to behave. It will, until it does not. The failure is not loud. It is statistical. One in fifty calls returns a slightly wrong field name. One in two hundred returns a hallucinated value that passes basic checks but is factually wrong. One in a thousand produces output that crashes the parser downstream. At low volume this looks like noise. At scale it is a business problem.

Drift also happens at the prompt level itself. Teams iterate on the instruction over time, adding edge cases, patching observed failures, layering in caveats. Each addition seems harmless. The prompt grows from forty words to four hundred, and somewhere in that growth two instructions begin contradicting each other. The model resolves the contradiction differently depending on input, and now you have a system whose behaviour nobody fully understands, maintained by whoever last touched the prompt. This is technical debt, but it does not show up in any static analysis tool because the logic lives in natural language. The only way to catch it is evaluation: a held-out set of inputs with expected outputs, run on every change, with pass-rate thresholds that block deployment. Most teams do not have this. They have a Slack thread and a hope.

This is the same pattern that played out with web scraping ten years ago, with no-code automation five years ago, and with low-code internal tools more recently. Each cycle starts with a capability that lowers the floor for getting something working. A non-engineer ships a scraper in an afternoon. A team automates a workflow with Zapier in a week. Someone builds an internal dashboard in Retool over a weekend. The first version works, the demo lands, and the system is declared done. Then the source website changes its HTML, the API rate-limits, the data model shifts, and nobody knows how to fix it because nobody designed it to be maintained. The cost of the build was low. The cost of ownership was never accounted for.

AI pipelines are the current expression of this pattern, with one important difference: the failure surface is larger and quieter. A broken scraper throws an exception. A broken Zapier flow stops firing. A broken AI pipeline keeps running and produces output that looks correct. The detection lag is longer, and the blast radius is wider because the output is often customer-facing or feeding decisions. The same teams that learned to write tests for their code, monitor their APIs, and version their schemas have not yet internalised that LLM calls need the same treatment. The instinct is to treat the model as an oracle rather than a component. Oracles do not need tests. Components do.

The parallel extends to organisational behaviour. In every previous cycle, the teams that survived the transition from prototype to production were the ones who treated the new capability as infrastructure, not magic. They invested in the unglamorous work: error handling, monitoring, versioning, rollback, documentation, evaluation. The teams that did not invest got outpaced not by competitors with better tools, but by competitors with the same tools and a discipline around using them. AI is following the same curve. The differentiator over the next two years will not be access to better models. Everyone will have access. The differentiator will be the engineering rigour applied around the model, and that rigour is exactly what the ‘how hard can it be’ mindset skips.

The first attempt is a hypothesis, not a system. Shipping it as a system is a choice to absorb the cost of every failure mode you did not design around. That cost is real, it is measurable, and it almost always exceeds the cost of building the validation layer in the first place. Teams that learn this early stop confusing a successful prompt with a successful product. Teams that learn it late spend a quarter rebuilding what they should have built once.

The practical implication is uncomfortable for anyone hoping AI would make engineering easier. It does not. It changes what engineering looks like. The work shifts from writing logic to constraining inference, from handling known cases to bounding unknown ones, from testing functions to evaluating distributions. The skills overlap with traditional engineering but the failure modes are different enough that experienced engineers underestimate the gap. The discipline required is higher, not lower, because the component you are integrating is fundamentally less predictable than anything in your existing stack. Treating it with less rigour than you would treat a third-party API is a category error.

If you are building anything that depends on an LLM and runs more than a few times a day, the question is not whether the model can do the task. The question is what happens the hundredth time it does the task slightly wrong, and whether your system notices, recovers, and tells you. If you cannot answer that, you do not have a production system. You have a prototype that has not failed yet. The model will not save you from that. Nothing will, except the work you skipped when you decided how hard it could be.

The refund letter addressed to Dear [Name]

Keep Reading

A meditation app shipped a switch statement as AI

The demo passed. Two weeks later, the queue filled.

Better AI isn't what separates winning deployments.

Stay in the loop