Codex telemetry shows GPT-5.5 reasoning tokens pinning to 516, 1034, 1552
Original source
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
Hacker News →A GitHub issue against OpenAI’s Codex reports a statistical anomaly in reasoning-token telemetry: GPT-5.5 responses land on exactly 516 reasoning output tokens far more often than chance would predict, with secondary pileups near 1034 and 1552. The evenly spaced boundaries look like repeated thresholds rather than the smooth, task-dependent distribution you’d expect if the model simply reasoned until it was done. The filer draws on aggregate token_count metadata spanning February through June 2026, extending an earlier task-level report (#29353) where a run that stopped at exactly 516 tokens produced the wrong answer.
The evidence is concentrated rather than diffuse. GPT-5.5 makes up only about 19% of sampled responses but accounts for 82% of the exact-516 events, and its ratio of exact-516 to any run reaching 516-plus tokens runs roughly 34x higher than other models. Crucially, this isn’t a case of the model spending more effort overall — mean and P90 reasoning intensity actually fell from the February–April window into May–June, even as the clustering spiked. That combination is what makes the pattern suspicious: less reasoning on average, but with responses repeatedly snapping to the same fixed cutoffs.
The author is careful not to overclaim, explicitly declining to assert hidden chain-of-thought truncation, and instead asks whether some reasoning-budget cap, routing rule, fallback tier, or scheduler behavior is terminating GPT-5.5 runs at these boundaries. The significance for engineers leaning on Codex for complex work is direct: if the model is quietly hitting a budget ceiling on hard tasks, degraded output may be a systematic artifact rather than random variance. The issue closes with concrete validation queries — per-model exact-value counts and matched-task replays against GPT-5.2 and GPT-5.4 — that OpenAI could run to confirm or rule out a thresholding effect.
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.