your logs are lying to you
Five production failure modes in Claude Code platforms, the exact code that causes each, and the five-step debugging loop that isolates them.
now what is wrong here
At 04:17 UTC the scheduler stopped emitting heartbeats. The watchdog didn’t fire. The audit ledger showed 1,847 successful jobs in the last hour. None of them had actually run.
That’s the bug I want to talk about. Not because it’s clever. Because debugging Claude Code in production means accepting that your logs will lie to you, your exit codes will lie to you, and the only thing that doesn’t lie is the database row that should exist and doesn’t.
the symptom stack
Production failures on Claude Code platforms cluster into five shapes. Memorize these before you memorize anything else:
claude -preturns exit 0 with empty stdout- The subprocess hangs past your timeout and the timeout doesn’t trigger
- The skill loaded but the model ignored its instructions
- The cost on the dashboard tripled overnight with no traffic change
- The job ran, the artifact wrote, but the content is from yesterday’s prompt
Each of these has a different root cause. Each of them will look identical in your monitoring dashboard, which will show green.
empty stdout, exit 0
The single most common Claude Code production bug. You run:
result=$(claude -p "$prompt" --output-format json)
if [ $? -eq 0 ]; then
echo "$result" | jq -r '.content'
fi
Exit is zero. $result is the empty string. jq returns null. Your downstream code writes null to the database. The job is marked complete.
The CLI exited cleanly because the API call succeeded - and returned a content block with zero text. This happens when the model refuses, when the prompt hits a safety filter the SDK doesn’t surface, or when an MCP tool error gets swallowed mid-stream.
The fix is two lines. You don’t trust exit codes. You assert on the payload:
result=$(claude -p "$prompt" --output-format json)
if [ -z "$result" ] || [ "$(echo "$result" | jq -r '.content // empty')" = "" ]; then
echo "empty payload from claude -p" >&2
exit 42
fi
I ship exit code 42 specifically for this. Grep for exit 42 across the platform and you can count how many times the model went silent today. On a normal day: 3 to 8. On a day the API is degraded: 400+.
the timeout that doesn’t time out
subprocess.run(['claude', '-p', prompt], timeout=120) will not always kill the process at 120 seconds. If claude has spawned MCP server children, the timeout fires, Python raises TimeoutExpired, but the children outlive the parent and keep holding the API connection. You’ll see them in ps:
PID PPID CMD
31204 1 node /home/user/.npm/_npx/.../mcp-server-postgres
31205 1 python -m mcp_server_browser
PPID 1. Reparented to init. Still spending tokens.
The fix is to put claude in its own process group and kill the group:
import os, signal, subprocess
proc = subprocess.Popen(
['claude', '-p', prompt],
preexec_fn=os.setsid,
stdout=subprocess.PIPE,
)
try:
out, _ = proc.communicate(timeout=120)
except subprocess.TimeoutExpired:
os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
proc.wait(timeout=5)
raise
Before this fix the platform leaked roughly $11 a day in orphaned MCP children. After the fix: under $0.20. The commit is one file, 14 lines added.
the skill loaded but didn’t fire
You wrote a skill. It shows up in the system prompt. You test it interactively and it works. In production it silently doesn’t run.
First thing to check: the trigger conditions in the skill’s frontmatter description are competing with another skill that’s lexically closer to the user’s prompt. The model picked the wrong one. There’s no log line for this. The skill loaded, the model considered it, the model chose differently.
The diagnostic is to dump the full transcript and grep for the skill name. If the model never called Skill with your skill’s exact name, the trigger lost. The fix is in the description string, not the skill body. Make the trigger conditions exclusive - list what the skill is FOR and what it is NOT for in the same paragraph. The model reads both.
Second thing to check: the skill is invokable but the prompt that loaded it doesn’t fit in the context after compaction. The skill’s instructions got truncated. You’ll see the skill name in the transcript but its body missing. Move the load-bearing instructions to the top of the skill file. Anthropic’s compaction truncates from the bottom of skill bodies before it truncates from the conversation.
the cost tripled overnight
Your daily spend goes from $47 to $134. Traffic is flat. No new properties launched. The first place to look is not the API console. It’s the prompt cache hit rate.
A cache hit on Anthropic costs roughly 10% of a cache miss. If your cache breakpoints shifted - because a skill description changed, because the system prompt got a new line appended, because the tool list reordered - every call goes back to full price.
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
system=[
{"type": "text", "text": LARGE_STATIC_CONTEXT, "cache_control": {"type": "ephemeral"}},
],
messages=[{"role": "user", "content": prompt}],
)
print(resp.usage.cache_read_input_tokens, resp.usage.cache_creation_input_tokens)
Log those two numbers on every call. Alert when the ratio of cache_read to cache_creation drops below 8:1 on a workload that should be hot. The platform has a sentinel that pages me when that ratio breaks. The last time it fired was a one-character whitespace change in a skill description that invalidated 14 downstream caches.
the artifact is from yesterday
The job ran. The file wrote. The content is yesterday’s prompt output. You’re looking at idempotency-key collision.
If your job orchestrator uses f"{property}_{date}" as the dedup key and a job rescheduled across midnight UTC, two different prompts can land on the same key. The second one short-circuits because the system thinks it already ran. APScheduler does this by default if you reuse job IDs and have coalesce=True.
Fix:
scheduler.add_job(
run_property,
'cron',
hour=4,
id=f"{property}_{run_uuid}", # not date-based
coalesce=False,
misfire_grace_time=300,
max_instances=1,
)
Use a UUID per scheduled run, not a derived key. Store the dedup state in your own table with a real timestamp, not the scheduler’s in-memory map.
the debugging loop that actually works
For any Claude Code production bug, in order:
- Did the API call happen at all. Check the Anthropic console request log for the timestamp. If the call isn’t there, the bug is in your shell or your subprocess wrapper, not in the model.
- Did the call return content. Look at
usage.output_tokens. If it’s under 10, the model said almost nothing. That’s a prompt or trigger problem. - Did the content reach your code. Check what you logged immediately after the subprocess. If the API returned content and your variable is empty, the bug is in your parsing.
- Did your code write what it parsed. The artifact on disk versus the variable in memory. If they diverge, you have a write-path bug, usually a race condition with another worker.
- Did anything downstream consume the right artifact. If the job ran but the wrong file got picked up, you have a dedup or ordering bug.
Five checks. Each one rules out a layer. Do them in order. Do not skip ahead because you have a hunch.
what nobody tells you
The hardest production bugs on Claude Code aren’t in the model. They’re in the seams. The shell script that calls claude -p. The Python wrapper that parses JSON. The scheduler that decides when to run. The audit ledger that records success.
The model is the most reliable component in the stack. Your code around it is not. When something breaks, assume your code first. Spend an hour proving the model is wrong before you blame it. Most of the time you will fail to prove it, because it isn’t.
The audit ledger that lied about 1,847 successful jobs at 04:17 UTC? The bug was in the ledger’s commit logic - it logged success before the subprocess returned. Two lines moved. The model was fine.
Try Claude Code yourself: https://claude.com/claude-code
Contains a referral link.
Keep Reading
claude-codeEleven hours lost to one settings file
The undocumented Claude Code config flags, hooks, env vars, and permission patterns I rely on to run six properties in production.
ai agentsAI coding agent bypassed operator's sudo restriction
An AI agent routed around a sudo restriction under the operator's UID. The control was never the boundary. Operator behaviour was.
supply chain securityDetection is not prevention.
Malicious npm packages reached Red Hat cloud services. The boundary admitted code, then classified it. That sequence defines the failure.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.