RC RANDOM CHAOS

The cache we enabled was a 25% surcharge.

A broken prompt-cache config charged Foundry a 25% input premium for 11 days. The fix, the batch ratio, and why RAM prices set your token bill.

· 7 min read

Caching Around the RAM Tax

For 11 days, cache_read_input_tokens came back 0 on every scheduled generation job across all six Foundry properties. We were re-sending an identical 6,800-token system prompt 1,440 times a day and paying full Opus 4.8 input price for every byte. The cache was switched on, so we also paid the 1.25× write premium on each call and read it back zero times. That is not a missed discount. That is a 25% surcharge on input for a cache nobody touched.

The bill that exposed it: input spend on foundry-generate was running $149/day when the math said it should be a third of that.

The take this is responding to

The post that prompted this argues the LLM industry needs RAM prices to stay high, because cheap memory makes local models good enough to undercut hosted inference. Treat it as a market thesis, not a conspiracy. The mechanism is real either way. DDR5 and HBM pricing feeds the cost of every accelerator an inference provider racks, and that cost lands in the per-token rate I pay. Opus 4.8 is $5 per million input tokens and $25 per million output. I do not set that number. I do not run the silicon.

What I set is how many tokens hit that rate at full price. RAM is somebody else’s supply curve. Cache hit rate and batch ratio are mine. Foundry is a price-taker on tokens and a price-maker on token volume, and the second one is the only one I can ship a commit against.

So when input cost tripled, I did not open a ticket about RAM. I went looking for the tokens I was paying for twice.

What I was looking for vs what I found

I expected output to dominate. A 1,200-word post is roughly 2,400 output tokens once adaptive thinking is counted, and output is $25/M. I assumed the jobs were output-bound and the only move was shorter posts.

The usage object said otherwise.

resp = client.messages.create(model="[claude](https://claude.ai/referral/VUtFoiuiuw?utm_source=persona_fcefe83e&utm_medium=blog&utm_campaign=ai-article&utm_content=5ddfb9a0)-opus-4-8", ...)
print(resp.usage.input_tokens) # 7000
print(resp.usage.cache_creation_input_tokens) # 6800
print(resp.usage.cache_read_input_tokens) # 0 <-- every call

cache_creation_input_tokens was 6,800 on every request. cache_read_input_tokens was 0 on every request. The cache was written and never read. Prompt caching is a prefix match: the key is the exact bytes up to the cache_control breakpoint, and one byte of drift anywhere in that prefix invalidates the entry. Something in the prefix changed on every call.

Root cause

Here is what silently zeros a hit rate. Grep your prompt-assembly path for all of these before you blame the model:

  • datetime.now() / time.time() / any timestamp in the cached prefix
  • uuid4() or a request ID rendered near the top of the system prompt
  • json.dumps(tools) without sort_keys=True, or iterating a set
  • a tool list that varies per call (tools render at position 0, ahead of system)
  • a per-persona flag string-concatenated into the shared system block

We hit the first one. personas/base_prompt.py:41:

SYSTEM = f"""You are {persona.name}.
Current date: {datetime.now().isoformat()} # prefix byte drift, every call
{persona.style_rules}
{persona.banned_phrases}
..."""

The date line existed so each persona knew “today” when it referenced recency. It sat at line 2 of a 6,800-token prompt. Every call after the first changed those bytes, invalidated the entire cached prefix, and wrote a fresh one. The breakpoint at the end of the system block never had a stable prefix to match. We paid 1.25× input on 6,800 tokens, 1,440 times a day, for nothing.

The fix

Move the volatile bytes out of the cached prefix and behind the breakpoint. The stable persona prompt goes first and carries the cache_control marker. The date and topic move into the user turn, after the marker, where they are allowed to vary.

personas/base_prompt.py @ a4f9c12:

# Stable across every call for this persona - cached.
system = [{
 "type": "text",
 "text": f"You are {persona.name}.\n{persona.style_rules}\n{persona.banned_phrases}",
 "cache_control": {"type": "ephemeral"},
}]

# Volatile - never cached, lives in the user turn.
messages = [{
 "role": "user",
 "content": f"Current date: {today.isoformat()}\nTopic: {topic}",
}]

Render order is toolssystemmessages, so a breakpoint on the last system block caches tools and system together and the user turn stays free to change. Next deploy, the counter moved:

resp.usage.cache_read_input_tokens # 6800
resp.usage.cache_creation_input_tokens # 0 (after the first call in the window)

Cache reads bill at 0.1× input. The cached span went from $5/M at a 1.25× write premium to $5/M at a 0.1× read rate. That is the entire game.

The numbers, per job

Assume a 6,800-token system prompt, a 200-token topic, and 2,400 output tokens including adaptive thinking. Opus 4.8 at $5/$25.

StageInput $/jobOutput $/jobTotal $/job
Broken cache (write every call)0.04350.0600.1035
Fixed cache (read every call)0.00440.0600.0644
Fixed cache + Batch API0.00220.0300.0322

Across 1,440 jobs/day that is $149/day, then $93/day, then $46/day. The cache fix alone was one moved line and about $56/day. Monthly, the full stack took foundry-generate from roughly $4,470 to $1,392.

The second lever: batch everything that isn’t watching

Caching only attacks the input side. Output is still $25/M and a content job is output-heavy. The Message Batches API takes 50% off both input and output. The cost is latency: results land within an hour in practice, 24 hours worst case, and stay retrievable for 29 days.

A post scheduled to publish tomorrow morning does not need a 3-second response. None of Foundry’s generation is user-facing in real time. That is the entire qualifying condition for batching, and almost all of the spend qualifies.

The failure modes to know before routing a job to batch:

  • results come back unordered - key by custom_id, never by position
  • a request that 400s returns as an errored result, not a raised exception
  • prompt caching still works inside a batch, so the 0.1× read stacks on the 50% discount
batch = client.messages.batches.create(requests=[
 Request(custom_id=f"{prop}-{slug}",
 params=MessageCreateParamsNonStreaming(
 model="claude-opus-4-8", max_tokens=4000,
 system=system, # same cached block
 messages=messages))
 for prop, slug, system, messages in tonights_queue
])

APScheduler now assembles the next 24 hours of scheduled content into one batch submission at 02:00 instead of firing 1,440 synchronous calls across the day. scheduler/batch_runner.py @ e71b3d8.

Two more traps in the same week

Once the counter was monitored, it caught two more before they ran for days.

Opus 4.8 will not cache a prefix shorter than 4,096 tokens. One property’s persona prompt was 3,100 tokens. Marking it with cache_control did nothing: cache_creation_input_tokens stayed 0, no error, no warning, full price forever. Sonnet 4.6 caches down to 2,048, so the same prompt cached on the property we ran on Sonnet and silently did not on the one we ran on Opus. The fix was not padding the prompt. It was knowing the floor is model-specific and reading the counter instead of trusting the config.

The editing agent hit the 20-block lookback. Each breakpoint walks back at most 20 content blocks to find a prior entry. Our editing loop appends a tool_use/tool_result pair per edit, and a heavy pass blew past 20 blocks in one turn, so the next request’s breakpoint could not see the previous entry and missed. The fix was an intermediate breakpoint every ~15 blocks in long turns. agents/editor.py @ c0d8a41.

What it caught and what I changed

The deeper problem was not the date line. It was that nothing told me the date line existed. The read counter sat at zero for 11 days and the only signal was a cost curve I happened to open.

So the counter is a monitored metric now, not a thing I notice:

  • a sentinel asserts cache_read_input_tokens > 0 on every job after the first in a window, and pages if a property’s hit rate drops below 0.9
  • the scheduler pre-warms each persona’s cache at the top of its window with a max_tokens: 0 request, so the first real job reads instead of writes
  • the system prompt is frozen by contract - date, topic, and per-run state are injected after the breakpoint, enforced by a test that hashes the cached block across two synthetic calls and fails if the bytes differ

Open follow-ups:

  • move Haiku-eligible jobs (tag generation, slug cleanup, social rephrasing) off Opus 4.8 to Haiku 4.5 at $1/$5
  • raise the cache TTL to 1h on the three properties whose windows have gaps longer than 5 minutes, and eat the 2× write to keep the entry alive

I still do not control the price of RAM, and the post that started this is probably right that cheap memory would change the calculus on running models locally. It does not change today’s invoice. The invoice is a function of how many tokens you send at full rate, and that is a prefix-match problem and a latency-tolerance problem. Both fit in a commit.

Try Claude Code yourself: https://claude.com/claude-code


Contains a referral link.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.