RC RANDOM CHAOS

DELETE leaves the body

DROP TABLE and DELETE in Postgres destroy nothing: MVCC dead tuples, unlinked relfilenodes, and WAL keep the data recoverable, and replicas diverge.

· 7 min read
DELETE leaves the body

DELETE in Postgres removes no rows. The statement walks the live tuples that match the predicate, writes the deleting transaction’s xid into each tuple’s t_xmax header field, and flips the corresponding bit in the visibility map. The bytes stay on the heap page. The row is dead to every new snapshot and physically intact on disk. This is MVCC operating exactly as specified. The conclusion most security reviews skip: neither DELETE nor DROP TABLE is a destruction primitive. Both are bookkeeping. A data-handling model that treats either as erasure is wrong at the storage layer.

Correct the common claim first. DROP TABLE does not rebuild the table and does not rewrite the physical structure. It deletes catalog rows and unlinks files. No heap scan runs. No tuples move. The operations that rewrite a relation in full are VACUUM FULL, CLUSTER, and the rewriting forms of ALTER TABLE - those allocate a new relfilenode and copy live tuples into it. DROP does the opposite. It is fast because it touches almost nothing. The distinction decides the recovery surface, so precision is not optional.

Start with why DELETE does not scale. Each tuple carries a 23-byte header holding t_xmin, t_xmax, t_ctid, two infomask fields, and the null bitmap. A DELETE sets t_xmax to the current xid and emits a WAL record for every tuple touched. The space is not reclaimed at commit. It is reclaimed later, when autovacuum prunes the page, and only reused when a subsequent insert lands on that block. Until an overwrite arrives, the dead tuple is fully readable in the heap file. Bulk DELETE is therefore O(n) tuple marks, O(n) WAL, and unbounded bloat that converts directly into vacuum debt. The operation that clears a table without per-tuple cost is TRUNCATE or DROP, because neither one reads a single tuple. That is the entire basis for the claim that the only delete that scales is the one that abandons the rows wholesale.

The mechanism at the storage layer is the security story. TRUNCATE assigns the relation a fresh relfilenode, points pg_class at the new empty file, and schedules the old segment files for unlink at commit, all under an ACCESS EXCLUSIVE lock. DROP TABLE removes the dependent rows from pg_class, pg_attribute, pg_index, pg_constraint, and pg_depend, then issues smgr unlinks for the relation’s segments along with its indexes and TOAST relation. The on-disk blocks are never zeroed. The filesystem marks the extents free and moves on. On a copy-on-write filesystem or an SSD with wear leveling, the prior block contents persist past the unlink until the controller’s garbage collector reclaims them. DROP destroys nothing physical. It edits pointers.

Two abuse paths follow directly. The first is an unauthorized DROP or TRUNCATE delivered through SQL injection - CWE-89 - where an injected statement terminates a query with DDL the application never intended. This maps to MITRE T1485, data destruction, and to T1565 when the goal is manipulation rather than loss. The second is an operator with valid credentials, or a stolen credential, issuing DROP to remove evidence after an intrusion - T1070, indicator removal. Both rely on the same assumption the defender makes, that DROP erases. It does not. The data the actor intended to destroy persists in three locations: the unlinked-but-unzeroed blocks, the write-ahead log stream that recorded the DDL, and every base backup taken before the command ran. Point-in-time recovery to a target timestamp one second before the DROP reconstructs the table, its indexes, and its contents intact. There is no CVE for this. It is documented, intended behavior - which is precisely why it is overlooked. A misread of correct behavior produces the same exposure as a vulnerability.

Recovery of the supposedly-deleted data is mechanical, not speculative. A dead tuple left by DELETE retains its full header and column data on the heap page until vacuum prunes it and a later write reuses the line pointer; block-level inspection reads those tuples straight out of the page image. A relfilenode unlinked by DROP leaves its segment data in free extents until the filesystem or the SSD controller overwrites them, and a raw block scan recovers structured rows from that space. Neither path needs the catalog entry that DROP removed - the page format is self-describing enough to reconstruct column boundaries from the tuple headers alone. The WAL stream is the cleaner source: replaying the archive up to the LSN immediately preceding the DROP rebuilds the relation through the engine itself. Each of these is a standard part of database forensics, and each is available to an investigator and to an attacker who reaches the storage.

Replication turns the misread into divergence. Physical streaming replication ships the catalog deletes and the file unlinks as WAL records; the standby replays them and the table is gone on both nodes. A time-delayed standby configured with recovery_min_apply_delay holds the full pre-DROP state for the length of the delay window, queryable the entire time. Logical replication is the larger gap. Built-in logical replication does not replicate DDL. A DROP TABLE on the publisher removes the relation locally and leaves the subscriber’s copy live, writable, and serving reads. The schemas drift and no error is raised. TRUNCATE behaves differently again. It is replicated under logical replication by default for PG11 and later, but a publication whose publish list has been narrowed to exclude truncate, or a subscriber older than PG11, silently drops the operation. The dataset diverges with no failure signal. A table believed deleted continues to exist on the replica that was meant to be its mirror.

Telemetry is where the gap becomes operational. log_statement controls what the engine records. Set to ddl, it captures DROP and TRUNCATE. Set to mod, it captures DELETE as well. The default is none, which records neither. pgAudit raises the resolution further, emitting structured AUDIT entries tagged with the command and object class - DDL, DROP TABLE, with the fully qualified relation name - at session and object granularity. pg_stat_activity exposes the statement while it runs. pg_stat_user_tables shows n_dead_tup climbing under a bulk DELETE and resetting to zero after a TRUNCATE, a clean before-and-after signature. At the host level, auditd or Sysmon for Linux records the unlink syscalls against $PGDATA/base//. That unlink is the physical fingerprint of a DROP or TRUNCATE, and it is the one signal an attacker cannot suppress from inside the database.

What does not fire matters more. Managed Postgres removes the host layer from view. RDS, Aurora, and Cloud SQL do not expose the filesystem, so there is no unlink telemetry at all. CloudTrail logs the management-plane API call, not the SQL statement, so a DROP issued over an existing connection is invisible to it. SQL-level visibility depends entirely on the engine’s own log export being switched on, and the default RDS parameter group does not log DDL. A DELETE used to corrupt rather than destroy generates only mod-level entries, and on a busy OLTP system that line is buried under legitimate writes. The WAL volume spike from a large DELETE is visible in pg_stat_wal and in archive throughput, but it presents as load, not as an attack. Retention compounds the exposure. WAL kept for point-in-time recovery, and replication slots that pin segments until a lagging subscriber consumes them, both extend the window in which the deleted data stays reconstructable - often far longer than any data-retention policy assumes.

The technical reality at the close. DROP TABLE is not a delete and not an erase. It is a catalog edit plus a set of file unlinks. The data survives in unlinked blocks, in the WAL archive, in PITR-recoverable base backups, and - under logical replication - running live on the subscriber. For recovery, that survival is insurance. For incident response, it is a forensic asset; an unexpected DROP in the ddl log is an escalation trigger, and the correct first action is to freeze WAL recycling and the backup set and hand the timeline to the security team before anything is recovered or rotated. For compliance, that same survival is the exposure. Australian Privacy Act erasure obligations and SOCI data-handling expectations are not satisfied by a command that leaves the bytes resident on disk, in the archive, and on a replica. Genuine erasure of regulated data is crypto-shredding - an encrypted tablespace whose key is destroyed - or media sanitization. Not DDL. After the DROP, the residual exposure is the entire pre-DROP dataset, reachable by anyone holding the backups or the storage. The command that looks like the most destructive verb in the language destroys the least.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.