now what is wrong here

At 04:17 UTC the scheduler stopped emitting heartbeats. The watchdog didn’t fire. The audit ledger showed 1,847 successful jobs in the last hour. None of them had actually run.

That’s the bug I want to talk about. Not because it’s clever. Because debugging Claude Code in production means accepting that your logs will lie to you, your exit codes will lie to you, and the only thing that doesn’t lie is the database row that should exist and doesn’t.

the symptom stack

Production failures on Claude Code platforms cluster into five shapes. Memorize these before you memorize anything else:

claude -p returns exit 0 with empty stdout
The subprocess hangs past your timeout and the timeout doesn’t trigger
The skill loaded but the model ignored its instructions
The cost on the dashboard tripled overnight with no traffic change
The job ran, the artifact wrote, but the content is from yesterday’s prompt

Each of these has a different root cause. Each of them will look identical in your monitoring dashboard, which will show green.

empty stdout, exit 0

The single most common Claude Code production bug. You run:

result=$(claude -p "$prompt" --output-format json)
if [ $? -eq 0 ]; then
 echo "$result" | jq -r '.content'
fi

Exit is zero. $result is the empty string. jq returns null. Your downstream code writes null to the database. The job is marked complete.

The CLI exited cleanly because the API call succeeded - and returned a content block with zero text. This happens when the model refuses, when the prompt hits a safety filter the SDK doesn’t surface, or when an MCP tool error gets swallowed mid-stream.

The fix is two lines. You don’t trust exit codes. You assert on the payload:

result=$(claude -p "$prompt" --output-format json)
if [ -z "$result" ] || [ "$(echo "$result" | jq -r '.content // empty')" = "" ]; then
 echo "empty payload from claude -p" >&2
 exit 42
fi

I ship exit code 42 specifically for this. Grep for exit 42 across the platform and you can count how many times the model went silent today. On a normal day: 3 to 8. On a day the API is degraded: 400+.

the timeout that doesn’t time out

subprocess.run(['claude', '-p', prompt], timeout=120) will not always kill the process at 120 seconds. If claude has spawned MCP server children, the timeout fires, Python raises TimeoutExpired, but the children outlive the parent and keep holding the API connection. You’ll see them in ps:

 PID PPID CMD
31204 1 node /home/user/.npm/_npx/.../mcp-server-postgres
31205 1 python -m mcp_server_browser

PPID 1. Reparented to init. Still spending tokens.

The fix is to put claude in its own process group and kill the group:

import os, signal, subprocess

proc = subprocess.Popen(
 ['claude', '-p', prompt],
 preexec_fn=os.setsid,
 stdout=subprocess.PIPE,
)
try:
 out, _ = proc.communicate(timeout=120)
except subprocess.TimeoutExpired:
 os.killpg(os.getpgid(proc.pid), signal.SIGTERM)
 proc.wait(timeout=5)
 raise

Before this fix the platform leaked roughly $11 a day in orphaned MCP children. After the fix: under $0.20. The commit is one file, 14 lines added.

the skill loaded but didn’t fire

You wrote a skill. It shows up in the system prompt. You test it interactively and it works. In production it silently doesn’t run.

First thing to check: the trigger conditions in the skill’s frontmatter description are competing with another skill that’s lexically closer to the user’s prompt. The model picked the wrong one. There’s no log line for this. The skill loaded, the model considered it, the model chose differently.

The diagnostic is to dump the full transcript and grep for the skill name. If the model never called Skill with your skill’s exact name, the trigger lost. The fix is in the description string, not the skill body. Make the trigger conditions exclusive - list what the skill is FOR and what it is NOT for in the same paragraph. The model reads both.

Second thing to check: the skill is invokable but the prompt that loaded it doesn’t fit in the context after compaction. The skill’s instructions got truncated. You’ll see the skill name in the transcript but its body missing. Move the load-bearing instructions to the top of the skill file. Anthropic’s compaction truncates from the bottom of skill bodies before it truncates from the conversation.

the cost tripled overnight

Your daily spend goes from $47 to $134. Traffic is flat. No new properties launched. The first place to look is not the API console. It’s the prompt cache hit rate.

A cache hit on Anthropic costs roughly 10% of a cache miss. If your cache breakpoints shifted - because a skill description changed, because the system prompt got a new line appended, because the tool list reordered - every call goes back to full price.

from anthropic import Anthropic
client = Anthropic()

resp = client.messages.create(
 model="claude-opus-4-7",
 max_tokens=1024,
 system=[
 {"type": "text", "text": LARGE_STATIC_CONTEXT, "cache_control": {"type": "ephemeral"}},
 ],
 messages=[{"role": "user", "content": prompt}],
)
print(resp.usage.cache_read_input_tokens, resp.usage.cache_creation_input_tokens)

Log those two numbers on every call. Alert when the ratio of cache_read to cache_creation drops below 8:1 on a workload that should be hot. The platform has a sentinel that pages me when that ratio breaks. The last time it fired was a one-character whitespace change in a skill description that invalidated 14 downstream caches.

the artifact is from yesterday

The job ran. The file wrote. The content is yesterday’s prompt output. You’re looking at idempotency-key collision.

If your job orchestrator uses f"{property}_{date}" as the dedup key and a job rescheduled across midnight UTC, two different prompts can land on the same key. The second one short-circuits because the system thinks it already ran. APScheduler does this by default if you reuse job IDs and have coalesce=True.

Fix:

scheduler.add_job(
 run_property,
 'cron',
 hour=4,
 id=f"{property}_{run_uuid}", # not date-based
 coalesce=False,
 misfire_grace_time=300,
 max_instances=1,
)

Use a UUID per scheduled run, not a derived key. Store the dedup state in your own table with a real timestamp, not the scheduler’s in-memory map.

the debugging loop that actually works

For any Claude Code production bug, in order:

Did the API call happen at all. Check the Anthropic console request log for the timestamp. If the call isn’t there, the bug is in your shell or your subprocess wrapper, not in the model.
Did the call return content. Look at usage.output_tokens. If it’s under 10, the model said almost nothing. That’s a prompt or trigger problem.
Did the content reach your code. Check what you logged immediately after the subprocess. If the API returned content and your variable is empty, the bug is in your parsing.
Did your code write what it parsed. The artifact on disk versus the variable in memory. If they diverge, you have a write-path bug, usually a race condition with another worker.
Did anything downstream consume the right artifact. If the job ran but the wrong file got picked up, you have a dedup or ordering bug.

Five checks. Each one rules out a layer. Do them in order. Do not skip ahead because you have a hunch.

what nobody tells you

The hardest production bugs on Claude Code aren’t in the model. They’re in the seams. The shell script that calls claude -p. The Python wrapper that parses JSON. The scheduler that decides when to run. The audit ledger that records success.

The model is the most reliable component in the stack. Your code around it is not. When something breaks, assume your code first. Spend an hour proving the model is wrong before you blame it. Most of the time you will fail to prove it, because it isn’t.

The audit ledger that lied about 1,847 successful jobs at 04:17 UTC? The bug was in the ledger’s commit logic - it logged success before the subprocess returned. Two lines moved. The model was fine.

Try Claude Code yourself: https://claude.com/claude-code

Contains a referral link.