RC RANDOM CHAOS

The Open Courts Act exposes what PACER fees hid

PACER's per-page fee was an accidental privacy brake. Making court records free is right - but only if redaction, governed bulk access, and security replace it.

· 18 min read
The Open Courts Act exposes what PACER fees hid

PACER charges ten cents a page, capped at three dollars per document, and waives the bill if you run up less than thirty dollars in a quarter. Those numbers describe the entire public’s access to United States federal court records. The fee was set to recover operating costs. It has quietly done something else: it has priced bulk collection out of reach for almost everyone. A scraper that wants every civil docket in the country hits a meter that turns a weekend project into a six- or seven-figure invoice. The meter is the brake.

That brake is coming off. The Open Courts Act and the bills that followed it have repeatedly proposed making PACER free. A federal lawsuit, National Veterans Legal Services Program v. United States, already found that the judiciary charged more than the statute allowed. Free access is the right outcome. Records that cost money to read are not really public, and the people most often blocked by the meter are public defenders, pro se litigants, small-town reporters, and academic researchers - not the people you would worry about.

The problem is what the fee was hiding. Ten cents a page was never a privacy safeguard. It was rent. It happened to sit in front of a system with no real privacy architecture behind it, and as long as the rent was there, nobody had to look at what was missing. Remove the rent and you do not create a new risk. You expose one that has been there the whole time. The records were always sensitive. They were just expensive enough that nobody pulled all of them at once.

The useful way to think about this is to stop calling it a records system and start calling it what it becomes the moment access is free and bulk: a dataset. Tens of millions of documents, generated by a known process, in semi-structured formats, describing real people by name, tied to real outcomes. That is not a filing cabinet. That is a corpus. And corpora get trained on, queried, linked, and abused according to a completely different set of rules than the ones courts wrote for paper.

What is actually in the file

Start with the contents, because most of the debate skips this part. A court file is not a tidy summary of a verdict. It is the full paper trail: complaints, motions, exhibits, deposition excerpts, financial affidavits, medical records entered into evidence, bank statements, tax returns, immigration documents, the names of witnesses and minor children, and the home addresses of people who never chose to be in the system at all.

Federal Rule of Civil Procedure 5.2, with its siblings in the criminal and bankruptcy rules, is supposed to manage this. It says filers must redact Social Security numbers down to the last four digits, give only the year of a birth date, use initials for minors, and show only the last four digits of financial accounts. Read that rule carefully and you notice who is responsible for the redaction: the filer. Not the clerk, not the court, not the platform. The lawyer or the self-represented party is expected to scrub the document before it goes in.

That assumption fails constantly. Exhibits get filed as flat image scans where the redaction is a black box drawn on top of selectable text underneath. Pro se litigants who have never heard of Rule 5.2 attach their full tax return because the judge asked for proof of income. Bankruptcy schedules - which by design require a near-complete inventory of someone’s financial life - get filed with account numbers intact. A divorce file can contain a custody evaluation describing a child’s psychiatric history by name. A personal injury case can contain an MRI report.

None of this is hypothetical or rare. Researchers who have sampled PACER have repeatedly found unredacted Social Security numbers sitting in documents that have been public for a decade. The data is there. The only thing that has kept it from being harvested in bulk is that harvesting cost money. When a single unredacted SSN costs you ten cents to find and you do not know which of forty million documents contains it, you do not go looking. When the whole corpus is free and a model can read every page in an afternoon, the economics invert completely.

Practical obscurity, and why it dies at scale

The law already has a name for the thing the fee was protecting, and it is worth knowing because it explains the entire problem. In 1989, in Department of Justice v. Reporters Committee for Freedom of the Press, the Supreme Court considered whether the FBI had to release a compiled criminal rap sheet under the Freedom of Information Act. Every individual entry on that sheet was, somewhere, a public record. The arrest was public. The conviction was public. The Court still found a privacy interest in the compilation, and it gave the idea a name: practical obscurity. Information that is technically public but scattered, hard to find, and expensive to assemble carries a privacy interest that the same information, aggregated and indexed, does not.

That doctrine maps onto court records with uncomfortable precision. A single docket sitting behind a meter, retrievable one document at a time by someone who already knows the case number, is practically obscure. The same docket inside a free, fully indexed, machine-readable bulk dataset is not obscure at all. Nothing about the underlying record changed. The friction changed, and the friction was carrying the privacy.

This is the part that trips up the transparency argument. People hear privacy interest in public records and assume someone is trying to hide wrongdoing or seal proceedings. That is a different fight. Practical obscurity is not about whether the record is public. It is about the difference between a record being available and a record being aggregated. You can be completely in favor of open courts and still recognize that one person reading one file is a different event than a model ingesting every file.

Bulk access destroys practical obscurity by definition. That is the point of bulk access - it is what makes it useful for research, for journalism, for accountability, and for everything good people want out of free court records. The same property that makes the corpus valuable to a civil rights researcher makes it valuable to anyone running a population-scale linkage attack. You cannot deliver one without delivering the other unless you change the data itself.

The fee was never a safeguard

It is tempting, having said all that, to conclude that the fee should stay. That conclusion is wrong, and seeing why it is wrong is the whole argument.

A paywall is a terrible privacy control because it filters by budget, not by intent. PACER’s ten cents a page stops a journalist on a nonprofit salary and a law student and a man trying to read the docket of his own eviction. It does not stop a data broker, a foreign intelligence service, a fraud operation with a balance sheet, or any organization that views a few hundred thousand dollars as a rounding error against the value of the data. The fee is regressive in the exact direction you would not want a security control to be regressive: it is strongest against the harmless and weakest against the funded.

It also generated real money - on the order of a hundred and forty-five million dollars a year at its peak - which created an institutional incentive to keep it. The NVLSP litigation pried at that, finding the judiciary had spent fee revenue on things the statute did not authorize. Once a court tells you the fee exceeded its lawful basis, defending it as a privacy feature becomes untenable. It was rent the whole time.

So the honest position is not keep the meter. It is the meter was load-bearing for a structure nobody admits exists, and before you remove it you have to build the structure on purpose. Free access is correct. Free access without a redaction pipeline, an access-logging regime, and a real infrastructure-security posture just means you have removed the last accidental brake from a system that never had real ones.

This is the systems-analysis point underneath the whole topic. The fee was doing a job it was never assigned. When you remove a component, you do not just lose the function it was supposed to perform. You lose every function it was performing by accident. Good engineering finds those hidden dependencies before it cuts the wire, not after.

What the corpus buys an attacker

The phrase adversarial AI training gets used loosely, so be concrete about what an open court-records corpus actually provides to someone with bad intent. It is not one capability. It is several, and they stack.

First, structure. Court filings follow rigid templates. A model trained on millions of complaints, motions, summonses, and judgments learns the exact form of a valid filing in a given district - the captions, the boilerplate, the service language, the judge-specific local conventions. That is the raw material for generating documents that look authentic to a clerk and to an automated intake system. Forged or fraudulent filings are an old problem; a model that has read every real filing in a jurisdiction makes producing convincing fakes cheap and scalable.

Second, prediction. Legal analytics is already a commercial industry. Firms like Lex Machina, Premonition, and Pre/Dicta sell outcome prediction built on exactly this data - how a given judge rules, how long a case type takes, which arguments succeed before whom. None of that is inherently malicious; litigants have always wanted to read the room. But the same model that helps a plaintiff choose a venue helps a bad actor identify which judges, which case types, and which procedural moves are most exploitable, and where the system bends under volume.

Third, procedure. A model that has ingested the full lifecycle of millions of cases learns where the seams are: which motions reliably trigger delay, which deadlines cascade, which kinds of filings tie up a docket. Litigation has always had people who weaponize procedure. Automating the discovery of procedural pressure points is a capability uplift, and it is one the open corpus hands over for free.

Fourth, and most underrated, targeting data about people. This is the one that matters most to ordinary readers, and it deserves its own section.

Re-identification is the real exposure

The most durable harm from a free, bulk court-records corpus is not exotic. It is linkage. You take the named, dated, located information in court files and you join it against everything else that has leaked, been sold, or been published. The court record supplies the high-value fields - the legal event, the financial detail, the family relationship - and the other datasets supply the contact information and the confirmation.

The math here is brutal and well established. Latanya Sweeney’s classic work showed that a large majority of the United States population - her famous estimate was around 87 percent - can be uniquely identified by just three fields: five-digit ZIP code, gender, and date of birth. Court records routinely contain all three, and Rule 5.2 only asks for the birth year, which does not save you once you have enough other anchors. De-anonymization is not a question of whether the record names someone. It is a question of how few fields it takes to pin a record to a real human, and the answer is: very few.

Now apply that at corpus scale with a model doing the joining. A bankruptcy filing tells you someone is in acute financial distress, lists their creditors, and often their address. A divorce file tells you a household is splitting, who has the children, and who controls the money. A civil suit tells you someone just received or is about to receive a settlement. A criminal docket tells you someone has a record, a court date, and a likely level of desperation. Each of these is a targeting signal. Together, indexed and linked, they are a map of who is vulnerable, why, and how to reach them.

The court system did not consent to building that map, and no individual filer did either. It emerges from aggregation. This is why practical obscurity was doing real work - not because any single fact was secret, but because assembling the facts into a profile took effort that nobody would spend on an ordinary person. Free bulk access plus a capable model removes the effort. The profile assembles itself.

Social engineering at population scale

The direct consequence of that map is targeted fraud, and the targeting can be tuned with a precision that older scams never had.

Consider what a recently-divorced person looks like in the data: a new filing, a change in marital status, often a change in address, frequently a description of assets in dispute. That person is, predictably, dealing with banks, lawyers, and government forms during a stressful period. A phishing message that references their actual case number, their actual county, and their actual lawyer’s name is not a generic scam. It is a tailored one, and tailored messages convert at rates generic ones never approach. The corpus supplies every element needed to write it.

Run the same logic across the other record types. The newly bankrupt are an ideal audience for fake debt-relief and credit-repair offers, and their filings hand over the creditor list that makes the pitch credible. People with fresh civil settlements are targets for investment fraud timed to the moment they come into money. Litigants in immigration proceedings are vulnerable to notario fraud and threats, and the docket tells a scammer exactly where they are in the process. Witnesses and jurors named in files can be located and pressured. None of this requires breaking into anything. It requires reading public records at a scale that used to be impractical and is about to be free.

The defense that this data was always public misses the operational reality. Scams scale with targeting quality and targeting cost. For decades the cost of building a precise target list from court records was high enough that fraud stayed broad and dumb - the spray-and-pray email everyone learned to ignore. Drop the cost to near zero and let a model do the personalization, and fraud becomes narrow and convincing. The harm is not that the information exists. The harm is that it can now be operationalized cheaply against millions of named individuals at once.

The France experiment, and why it is the wrong lever

France looked at one slice of this problem and reached for a blunt instrument. In 2019, as part of its justice reform law, France made it a crime to reuse the identity data of judges and court clerks to evaluate, analyze, or predict their professional decisions. The penalties referenced existing data-misuse provisions carrying up to five years of imprisonment. The target was judge analytics - the practice of profiling individual judges to predict how they will rule.

It is a revealing move because it shows a government taking the amplification risk seriously, and it shows the most common wrong answer. France did not change the data. It banned an analysis. The judgments remain available; what became illegal is a particular use of them. That approach has two failure modes that any systems person will recognize immediately.

The first is that banning a use does not remove a capability. The data is still there, the model still trains, and the analysis still happens - it just happens somewhere outside French jurisdiction, by anyone who does not care about French criminal law, which includes exactly the adversaries you were worried about. A control that binds the law-abiding and waves through the hostile is the paywall problem in a different costume.

The second is that it suppresses the legitimate uses along with the illegitimate ones. Profiling judges to harass them is bad. Studying judicial patterns to expose sentencing disparities, racial bias, or inconsistent application of the law is accountability journalism and empirical legal research - the precise public good that free court records are supposed to enable. A use-based ban cannot tell the two apart, because they are the same technique pointed at different ends.

The lesson from France is not that they overreached. It is that they pulled the wrong lever. If you want to govern an amplification risk, you govern the data and the access, not the analysis. You make the corpus safer to release, and you instrument how it is released. You do not try to outlaw math.

The courts cannot secure what they already have

There is a quieter problem underneath the access debate, and it should temper anyone’s confidence that the system is ready for bulk exposure: the courts struggle to secure the data they hold right now, behind the meter.

The federal judiciary’s electronic filing system, CM-ECF, sits behind PACER as the public front end. In early 2021 the judiciary disclosed that the broader SolarWinds compromise had likely reached its systems and may have exposed sealed records - the most sensitive material the courts hold, including cooperating-witness information and confidential business data. The response was telling: the judiciary moved its most sensitive new filings off the networked system and onto paper or standalone machines. That is a sound emergency measure and also an admission that the digital system could not be trusted to keep secrets. Reporting through 2025 has continued to describe intrusions into the federal case-management infrastructure, which means this is not a closed incident from years ago. It is an ongoing condition.

Sit with what that means for the free-access push. Sealed records are the documents the system is most determined to protect, guarded by court orders and access controls. If those can be reached, the unsealed-but-sensitive mass - the unredacted SSNs, the medical exhibits, the financial affidavits that are technically public and merely obscure - has no comparable defense at all. It is sitting in the open, protected by friction. The infrastructure that would have to safely serve a free bulk corpus is the same infrastructure that has already been breached while serving a metered one.

Releasing data is also a security commitment, not just a transparency one. The moment the corpus is free and bulk, the courts become the custodian of a high-value dataset whose every weakness is now worth exploiting, because the payoff scales with completeness. You do not get to make the data trivially collectible and keep the relaxed security posture that came from assuming nobody would bother collecting all of it.

Redaction has to move to the source

The real control - the one the fee was crudely standing in for - is redaction, and it is broken because it lives in the wrong place. Today redaction is the filer’s job, performed once, by hand, by whoever happens to be submitting the document, with no verification step before publication. That is a control placed at the point of least capability and least accountability.

Move it. Redaction belongs at the platform, applied to every document on the way in, before anything becomes public, and verified rather than trusted. The technology to do this at scale exists and is in production use elsewhere: automated detection of Social Security numbers, account numbers, dates of birth, and names follows recognizable patterns, and modern entity-recognition handles the fuzzier cases like minors’ names and medical identifiers. It is not perfect. It does not have to be perfect to be a vast improvement over a hand-drawn black box on a flat scan that anyone can copy the text out from under.

The specific failures have known fixes. The black-box-over-live-text problem is solved by flattening and re-OCRing documents so the redaction removes the underlying characters instead of hiding them. The image-scan problem is solved by running detection on the rendered text, not the metadata. The pro se problem - the self-represented filer who has never heard of Rule 5.2 - is solved by not relying on that person at all, and instead treating every inbound document as untrusted and scanning it server-side.

This costs money, and that is the connection back to the fee. The cleanest version of free court records is one where the legislation that kills the PACER charge redirects part of what the system used to collect into a redaction and security budget. You are not eliminating the cost of running a safe public-records system. You are moving it off the backs of the public who want to read records and onto the institution that publishes them, which is where it belonged. Free to read, funded to redact.

What free-and-governed actually looks like

The choice is not between expensive-and-obscure and free-and-exposed. There is a design in between, and pieces of it already exist.

Keep retail access free and open. A person looking up a single case, a docket, or a document should pay nothing and clear no hurdle. This is the core promise and it should be unconditional. Retail access does not destroy practical obscurity, because reading one file is not aggregation.

Govern bulk access without forbidding it. The danger lives in the difference between one document and forty million, so put the controls there. Serve bulk through an instrumented API with authentication, rate limits, query logging, and terms of use, so that mass collection is observable, attributable, and revocable rather than anonymous and unlimited. This is ordinary data-platform practice, and it lets researchers and journalists keep their legitimate access while turning silent harvesting into a logged event someone can see. The Free Law Project already demonstrates the cooperative side of this with RECAP and CourtListener, which assemble a free, searchable archive of court documents - proof that open bulk access can be built deliberately rather than as an unmanaged side effect.

Redact at the source and tier the fields. Apply automated redaction to everything on ingestion, and recognize that not every field needs to travel with every access path. A bulk research feed can expose the legal substance - case type, outcome, timing, legal arguments - while suppressing or coarsening the direct identifiers that make linkage attacks cheap. The substance is what accountability work needs. The identifiers are what re-identification needs. They can be separated.

Treat the release as a security program, not a publishing event. If the data is going to be free and complete, the custodian’s threat model has to match the new value of the target. That means the breach response the judiciary already had to improvise becomes a standing posture: monitoring, segmentation, and the assumption that the corpus is under continuous collection pressure because it now is.

None of this is exotic. Every element - free retail access, governed bulk APIs, automated redaction, field tiering, access logging - exists in production somewhere. What does not exist is the decision to apply them to court records before the meter comes off rather than after.

Where this nets out

Court records should be free. The paywall is regressive, its legal basis is shaky, and it blocks the people the system most needs to serve while barely inconveniencing the people it should worry about. Defending the fee as a privacy control means defending an accident, and a bad one.

The honest version of that position carries an obligation. The ten-cent meter was load-bearing for a privacy property - practical obscurity - that the courts never built on purpose and have never replaced. Remove the meter without replacing the property and you have not opened the courts. You have published a forty-million-document corpus of named, dated, located, financially detailed information about real people, much of it improperly redacted, served from infrastructure that has already been breached, into a world where a capable model can read all of it in an afternoon and join it against everything else that has leaked.

The corpus is coming either way. The only open question is whether it ships with redaction at the source, governed bulk access, and a security posture that matches its new value - or whether it ships the way the fee left it, exposed and merely cheap to assemble. The bill that makes court records free has to also fund the thing that makes them safe to free. If it does only the first half, it will not have opened the courts. It will have removed the last brake from a system that never had real ones.

See also: NordVPN for tunneled traffic when operating outside controlled networks.


#ad Contains an affiliate link.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.