RC RANDOM CHAOS

Fast enough to lie

Package managers hang for minutes because they execute on a returned value, never measuring the network latency their design assumed would stay constant.

· 8 min read
Fast enough to lie

A package manager resolving a dependency makes a short sequence of network calls and blocks until they return. It queries a resolver for an address, opens a TCP connection, negotiates TLS, and issues an HTTP request to a registry such as npm or PyPI. None of these operations reports progress. The tool renders no bar, no spinner, no elapsed counter, because it was built on the premise that the whole sequence finishes faster than a person can perceive the pause between the command and the result.

The absence of a progress indicator is not an oversight. It is a design statement. A progress bar exists to make waiting tolerable, and the tool’s authors concluded there would be nothing to wait for. A DNS lookup returns in milliseconds. A TCP three-way handshake to a nearby endpoint completes in a single round trip. The retrieval was fast enough that instrumenting it would have cost more than it returned. So the tool treats the network call the way it treats a local function: something that either comes back at once or, in the rare failure, times out.

What the tool actually observes is not the network. It observes the return of a value. The distinction stays invisible while the two coincide. As long as the call resolves quickly, the tool’s model of retrieval and the real behavior of the network are indistinguishable. The tool does not know how long the traversal took. It knows only that the value arrived, and it was written to assume that arrival is prompt. The proxy, a returned value, stands in for the reality, which is a passage across contended, shared infrastructure. While the passage is fast, the substitution costs nothing. That is the condition the tool was designed inside.

The assumption was that network latency is a constant. Not literally fixed, but stable enough to treat as a property of the environment rather than a variable of the moment. The round-trip time observed when the tool was written was small, and the tool encoded that observation into its structure: synchronous blocking calls, connection timeouts measured in tens of seconds or left unbounded, retry logic that treats the fast path as the normal path. Latency was never modeled as a distribution. It was modeled as a number that happened to be low.

That assumption was reasonable when it was made, and it was cheap to make. The infrastructure the tool leans on, the resolvers and the registries and the transit paths between them, was provisioned for the load it then carried. TCP encodes the same optimism. Under RFC 9293, a connection attempt that receives no acknowledgment retransmits its SYN and doubles the wait each time, so the first several seconds of a stalled connection produce no signal at all. The protocol assumes silence is rare and brief. The tool inherits that assumption and stacks its own on top: that the registry answers, that the answer is quick, and that quickness is the steady state.

The trust here is not trust in a party. It is trust in a condition. The tool trusts that the environment it runs in resembles the environment it was tested in. That trust is delegated downward to every layer beneath it, to the socket, to the kernel’s network stack, to the resolver, to the path, and none of those layers is ever asked to confirm the condition still holds. The assumption of constant latency is transferable across time and portable across machines precisely because nothing revalidates it. It is written into defaults, and defaults are the assumptions a system has stopped questioning.

What changed was not the tool and not the people running it. What changed was the validity of the assumption. The scale and density of distributed computation rose, and the network the tool depends on now carries load it was never provisioned against. AI-era workloads move large artifacts, poll continuously, and fan out across the same shared resolvers, registries, and transit paths the package manager quietly assumed were idle enough to answer instantly. The medium is unchanged. The contention on it is not.

The latency distribution shifted, and its tail grew long. A call that returned in milliseconds under the old load now sometimes returns in seconds, and sometimes in minutes, because the endpoint is saturated or the path to it is congested. This is not failure in the sense the tool understands failure. The connection is not refused. The name resolves. The TLS handshake completes. The value eventually arrives. Every layer reports success. The only thing that moved is the time each layer took, and time is the single variable the tool declined to represent.

The system did not re-evaluate the condition it was built on. It could not, because it never held that condition as something to evaluate. It inherited the assumption of constant latency from the state of the network at design time and carried it forward, unaltered, into a state where the assumption is false. The tool still blocks without progress because it still believes there is nothing to wait for. That belief is now wrong, and the tool has no mechanism to notice it is wrong. The assumption no longer holds, and the system goes on behaving as though it does.

The tool blocks on a system call and waits for it to return. What it treats as the signal of success is the return itself. When connect() hands back a socket, when the resolver hands back an address, when the registry hands back a response body, the tool reads each return as confirmation that the operation completed as expected. It never reads the elapsed time, because elapsed time was never part of what it checks. The return of a value is the reference. The passage across contended infrastructure that produced the value is the reality. The tool executes on the reference and never inspects the reality behind it.

This is not a bypass. Nothing defeated the tool’s logic. The tool did precisely what it was built to do: issue a blocking call and proceed when the call returns. A blocking call, by construction, has one observable outcome that matters. It returns, or it does not. It has no observable field for returned slowly. The socket does not hand back a measure of contention. The HTTP 200 does not carry the duration of its own retrieval. Integrity of content, the response body, is intact. What is absent is any validation of the condition under which that content arrived. Identity of the returned value stands in for the state of the environment that delivered it, and the substitution is invisible from inside the tool because the tool holds no representation of that state.

So the delay is not the tool failing to detect a problem. The delay is the tool detecting nothing, because nothing in its model corresponds to duration. The connection timeout, often left at the operating system default of tens of seconds, is the only latency-shaped value the tool holds, and that value is a reference to a past network rather than a measurement of the present one. It fires late, or not at all, because it was calibrated against a condition that no longer describes the medium. The tool hangs for minutes and reports success at the end, because from inside its own logic a slow success and a fast success are the same event. It executed expected behaviour. Expected behaviour is now the failure.

The pattern is execution based on reference, not verification. A system resolves a condition once, records the result, and thereafter acts on the record rather than on the condition. The record is cheaper to consult than the condition is to re-establish, and while the two agree, consulting the record costs nothing and returns everything. The gap opens silently, in the interval between when the reference was written and when it is read, because nothing in the system marks that interval as a place where truth can change.

Consider certificate revocation. Under RFC 6960, a TLS client presented with an X.509 certificate is meant to check whether that certificate has been revoked by querying an OCSP responder. The signature on the certificate is a reference. It attests that a certificate authority, at some past moment, vouched for the binding between a key and a name. Revocation status is the present condition. When the OCSP responder is unreachable, most clients proceed anyway, a behaviour known as soft-fail. The handshake completes. The connection is established on the strength of the signature, the reference, while the condition that signature was supposed to stand behind, that the certificate remains valid now, went unverified. The client executed on identity of source. It did not verify integrity of the present state.

The two systems share no code and solve no common problem, yet they fail in the same shape. The package manager executes on the returned value and never measures the latency behind it. The TLS client executes on the signature and never confirms the revocation state behind it. Both hold a reference that was true when it was written. Both read that reference in a moment when it may no longer be true. Neither has a mechanism to notice the difference, because the reference was adopted precisely to avoid paying for verification, and a system does not audit the shortcut it was built to take.

The tool resolved latency once, at the moment it was designed. It does not measure it again. The value returns, the tool proceeds, and the minutes it waited leave no mark on anything it records.

The network did not break. The registry answered. Every layer reported success, and every layer was telling the truth. The only thing that moved was the one quantity the system chose not to represent.

The assumption of a constant environment is still there, written into the defaults. The control exists. The outcome does not.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.