A meditation app shipped a switch statement as AI
Whether a product 'really uses AI' is unanswerable and beside the point. What predicts reliability is system design: validated inputs, constrained outputs, fallbacks.
A meditation app shipped an ‘AI-powered’ breathing coach last year. Strip the marketing away and the feature is a lookup table: it reads the time of day and your last session length, then picks one of nine pre-written scripts. There is no model. There is no inference. There is a switch statement wearing a costume. Down the street, another team ships something that looks identical in the store listing but runs a fine-tuned model behind a retrieval layer, schema validation, and a fallback path for when the model returns garbage. Same two words on the label. Two completely different things underneath. And as a buyer, a user, or an engineer evaluating a vendor, you have almost no way to tell them apart from the outside.
The instinct is to be angry that the first product ‘lied’ about using AI. That is the wrong fight. Whether a feature is nine canned scripts or a 70-billion-parameter model matters far less than whether anyone designed it as a system. The meditation app’s lookup table is honestly fine - it is deterministic, cheap, and it does exactly what it claims every single time. The real failure is not products using the word ‘AI’ loosely. The real failure is that the overwhelming majority of these products, regardless of what is actually under the hood, were never built with defined inputs, constrained outputs, or a validation layer. They were built to demo and to list a feature, not to operate.
That is what the label noise is actually costing you. The argument over what counts as ‘real AI’ is a distraction that obscures the problems that matter: pipelines that shatter the moment an input looks unusual, output that gets trusted blindly because it sounded confident, and entire workflows duct-taped to a single fragile prompt string. None of those problems are about whether there is a model in the box. They are about whether anyone engineered the thing around the model. The label tells you nothing about that, and right now almost everyone is optimizing for the label.
What is actually happening is that ‘AI’ has collapsed into a single marketing token stretched across a wide spectrum of implementations. At one end you have hardcoded heuristics and rules - if-statements and lookup tables. A step over you have classical machine learning: a logistic regression or a gradient-boosted model trained on tabular data, often genuinely the right tool. Further along you have a single LLM API call driven by a prompt. Then an LLM call wrapped in retrieval so it has relevant context. At the far end you have a full orchestrated pipeline: defined inputs, structured outputs, retrieval, multiple model steps, validation between them, and deterministic control around the probabilistic core. From the outside, every one of these renders as the same badge in the UI. From the inside, they have nothing in common in terms of reliability, cost, or failure behavior.
The thing that separates these implementations is not the model. It is the architecture around the model. A regex with a clear contract, a defined input shape, and a validation step can behave more like real infrastructure than a raw LLM call with no guardrails. The LLM is more capable, but capability is not the same as reliability. What makes an AI feature hold up in production is boring and unglamorous: inputs that are validated before they reach the model, outputs constrained to a schema so downstream code can actually depend on them, a fallback path for when the model fails or stalls, observability so you can see what it did, and cost and latency budgets that hold under load. The model is maybe twenty percent of a working system. The other eighty percent is the orchestration nobody puts in the press release.
The incentive structure explains why this keeps happening. Market pressure rewards the appearance of intelligence, not the engineering of reliability. A team that adds ‘AI’ to a feature gets attention, a bump in conversion, sometimes a bump in valuation. A team that spends three weeks building a validation layer and a fallback path gets nothing visible to show for it. So effort flows toward the demo - the thing that looks magical in a thirty-second pitch - and away from the operational substrate that determines whether the feature survives contact with real users. The label is cheap to apply and the system is expensive to build, so the label wins, and the gap between what is claimed and what is engineered keeps widening.
Where this goes wrong starts with the belief that ‘real AI’ is the fix - that if the meditation app would just swap its lookup table for a proper language model, the product would get better. It usually gets worse. Dropping a probabilistic model into an architecture that was designed around a deterministic switch statement does not add intelligence; it adds unpredictability, higher cost, and latency, while removing the one property the old version had - it always did the same thing. Swapping the engine without redesigning the system around it trades predictable mediocrity for expensive, occasional, hard-to-reproduce failure. The model was never the constraint. The absence of a system was.
The second failure is blind trust: treating probabilistic output as if it were deterministic fact. Teams wire a model’s response straight into a workflow with no verification, because in the demo it answered correctly every time they tried. Then it ships, it meets inputs nobody anticipated, and it produces output that is fluent, confident, and wrong - and there is no layer to catch it, because catching it was never part of the design. Closely related is the brittle prompt pipeline, where an entire workflow hangs on one carefully tuned prompt string with no schema enforcement and no fallback. It works until someone changes the model version, or an input arrives in an unexpected shape, and the whole chain quietly degrades with no error and no alarm.
Underneath all of these is the same root pattern: building for the demo instead of for operations. A system designed to impress in a controlled pitch behaves nothing like a system designed to absorb real latency, real cost ceilings, real concurrency, and real edge cases. The demo version looks finished and is actually a prototype with good lighting. So the question worth asking is never ‘does this product use AI.’ That question is unanswerable from the outside and would not tell you much even if you could answer it. The question is whether what you are looking at was designed as a system - whether someone defined the inputs, constrained the outputs, and put a verification layer between the model and anything that depends on it. That is the line that actually predicts whether the thing will hold up, and it is exactly the line the label is built to hide.
The fix is not a sharper detector for what counts as ‘real AI.’ There isn’t one you can run from the outside, and hunting for it burns time you should spend elsewhere. What works is changing the question - of a vendor, of your own team, of a feature on a roadmap - from ‘is there a model in here’ to ‘is there a contract around it.’ A contract means someone wrote down what goes in, what comes out, and what happens when the model misbehaves. That is the thing you can inspect, test, and hold a team accountable to. The presence of a model is invisible and uninformative. The presence of a contract shows up the moment you ask one hard question.
For the people building these features, the working pattern is identical whether the core is a regex, a gradient-boosted classifier, or a 70-billion-parameter model. Validate the input before it reaches the model - normalize or reject anything outside the shape you expect, because that is where most production failures actually enter. Constrain the output to a schema so the code downstream depends on structure, not on fluent prose that happened to parse today. Put a verification step between the model and anything that acts on its answer: a confidence threshold, a rule check, a second pass. Define an explicit fallback for when the model times out, errors, or returns something low-confidence - the path the demo never needed and production always does. Instrument every step so that after the fact you can see exactly what the system did and why. Then set hard cost and latency budgets and enforce them under load, not in a quiet test. That sequence is the system. The model slots into the middle of it and is the smallest replaceable part.
For anyone evaluating a product instead of building one, the same architecture is exposed by the questions you ask. What happens when the input is malformed or in a language you didn’t plan for? What is the fallback when the model is down or simply wrong? How do you find out it was wrong - what gets logged, and who looks at it? What is the cost and the latency at p95 when traffic triples? A team that built a system answers these in concrete, specific terms because they had to solve each one to ship. A team that built a demo and bolted a label on top goes vague, abstract, or quiet. The vagueness is not a communication problem. It is the absence of the system, surfacing through the conversation.
Take a B2B support team adding ‘AI’ to auto-categorize and route inbound tickets. The demo version is one prompt: ‘categorize this ticket and assign a priority,’ with the model’s text answer wired straight into the routing logic. In the pitch it is flawless, because every ticket they tested was clean English with one clear issue. It ships. Then real tickets arrive - three problems crammed into one message, a forwarded thread with four signatures, sarcasm, an attachment with no body text. The model returns a priority of ‘fairly urgent’ or a category that does not exist in the routing table, and because nothing validates the output, tickets either crash the router or get silently misrouted into a queue nobody watches. Three weeks later the discovery arrives as an escalation from a customer who waited nine days for a reply.
The engineered version of the exact same feature looks nothing alike under the hood. The input stage strips signatures and quoted history, checks that there is actual body text, detects language, and truncates to fit the context budget. The model step is a classification call constrained to a schema - an enum of the real categories you route to, plus a numeric confidence - so the output is structurally incapable of naming a category that doesn’t exist. The validation layer enforces a threshold: anything under 0.7 confidence skips auto-assignment and drops into a human triage queue instead of guessing. The fallback covers the model stalling past two seconds or the API failing - those tickets route to a general queue, flagged, never dropped. Every call logs an input hash, the model version, the chosen category, the confidence, the latency, and the cost, onto a dashboard someone actually reads. From the storefront, both products say ‘AI-powered ticket routing.’ Only one of them survives a Monday with four thousand tickets, and only one of them tells you the day a model update quietly drops accuracy instead of letting you find out from an angry customer a month later.
Notice what carried the second system, and it was not the model - the same classifier sits inside both. What carried it was the eighty percent around the model that never makes the press release: the input validation, the schema, the threshold, the fallback, the logging, the budget. That ratio is the whole point. The model is the cheap, swappable, attention-grabbing twenty percent. The orchestration is the expensive, invisible, decisive remainder, and it is exactly the part the label is designed to let you skip.
So stop trying to decode the badge. Whether a product ‘really uses AI’ is unanswerable from where you stand and tells you almost nothing even when you somehow learn the answer. The lookup-table meditation app and the orchestrated retrieval pipeline wear the same two words, and the words are not where the difference lives. The signal that actually predicts whether a feature holds up is architectural: defined inputs, outputs constrained to a schema, a verification layer between the model and anything that depends on it, a fallback for failure, observability you can read, and cost and latency budgets that survive real load. That list predicts reliability. The label predicts marketing spend.
If you build, put your effort where the leverage is and accept that none of it will look impressive in a thirty-second pitch. The model is the easy part - choosing it, calling it, watching it dazzle in a controlled demo. The system is the hard part, and the hard part is the only part that operates. If you lead, stop rewarding the team that shows you magic and pressing the team that shows you a validation layer to move faster. Ask the operational questions in the first meeting, and treat the quality of the answers as your real diligence.
The uncomfortable truth underneath all of this is that AI changed nothing about how reliable systems get built. It only handed everyone a far more convincing way to skip the work and still look finished. Defined contracts, constrained outputs, fallbacks, and observability were the requirements before any of this, and they are the requirements now. The teams that win the next few years will not be the ones with the largest model or the loudest label. They will be the ones who treated the model as a single constrained component inside a system they actually engineered - and who were calm enough to ship the switch statement when the switch statement was the right answer.
Keep Reading
LLM engineeringThe demo passed. Two weeks later, the queue filled.
Prompt engineering treats AI as magic. Reliable LLM systems come from validation, retries, fallbacks, and monitoring - not better wording.
LLM engineeringThe refund letter addressed to Dear [Name]
Why ChatGPT's first output is a draft, not a deliverable, and what production AI systems actually require beyond the prompt.
LLM engineeringarXiv just raised the bar
arXiv's one-year ban on unchecked LLM errors signals a shift: validation pipelines, not better prompts, now define competent AI systems.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.