DuckDB trusts persisted blocks attackers control

DuckDB is an in-process OLAP engine written in C++. It ships as a library, not a service. There is no listener, no daemon, no socket to scan. The code runs inside whatever process imports it - python.exe, a Node worker, an R session, a browser tab compiled to WebAssembly. The vulnerability surface is not the network. It is the parser. The parser eats untrusted bytes inside the host’s address space with the host’s privileges.

That distinction is the whole post. DuckDB’s storage and execution model is built on immutability and integrity assumptions. Persistent blocks are written once and treated as immutable. MVCC uses copy-on-write rather than in-place mutation. Block-level checksums guard the file format. Each of those is an integrity assumption, and an integrity assumption is a control. When attacker-controlled data violates the assumption before the control validates it, the result is silent corruption. Corruption inside a columnar engine does not announce itself.

Immutability is the assumption worth dwelling on. DuckDB does not mutate persistent blocks in place. A transaction that modifies data writes new blocks and repoints metadata; the old block stays untouched until it is reclaimed. The design buys crash consistency and cheap snapshots. It also installs an assumption across the engine - a block, once written and checksummed, is what it claims to be. Code paths that read persisted structure are written against that assumption. They bounds-check less aggressively than a parser built for a known-hostile format would, because the format is treated as self-produced. The threat model the code was written under is ‘DuckDB wrote this file.’ The threat model that matters is ‘an attacker wrote this file.’ The gap between those two is the surface. Immutability is what makes the engine confident enough to skip the check.

Start with the storage layout. A DuckDB database is a single file divided into fixed-size blocks, 256KB by default. Blocks hold columnar segments. Metadata blocks describe where column data lives, which compression codec applies, and how to reconstruct vectors at scan time. The format carries checksums. A checksum proves a block matches what was written. It does not prove the block’s contents are semantically safe to parse, and several structures are read to locate and bound the checksummed region before the checksum is verified.

DuckDB executes on vectors. A DataChunk is a batch of columns, each a Vector of up to 2048 values carrying a data pointer and a validity mask. Strings use string_t - 16 bytes. Up to 12 bytes stored inline. Longer strings store a 4-byte length, a 4-byte prefix, and an 8-byte pointer into a string heap. That pointer is the surface. A length field and a heap offset, both drawn from the file, both trusted to point inside an allocation. A crafted length or offset turns a sequential scan into an out-of-bounds read. CWE-125 at the most basic. Write the wrong offset into a dictionary index during decompression and it becomes CWE-787.

Compression widens it. DuckDB stores columns under RLE, dictionary, bit-packing, FSST, and others. Each codec is a decoder that takes compressed bytes plus parameters - run lengths, dictionary sizes, bit widths - and reconstructs values into a vector. The parameters come from the file. An oversized run length, a bit width that overflows the destination stride, a dictionary count that exceeds the backing buffer - these are integer-overflow-to-overwrite primitives, CWE-190 feeding CWE-787. The decoder trusts the header because in normal operation the header was written by DuckDB itself. The attack assumes it was not.

The exploit path needs no network and no privilege escalation at entry. The attacker controls the bytes of a file. A .duckdb database. A Parquet file handed to read_parquet. A CSV with a crafted type. The file moves through a normal channel - a shared bucket, a pipeline artifact, a sample dataset, an attachment a data engineer opens in a notebook. The victim runs a query. The scan operator reads the malformed metadata, computes a pointer or a length from attacker bytes, and dereferences it inside the host process. That is the primitive. From a controlled out-of-bounds read or write inside a heap shared with the rest of the application, the standard heap-grooming work applies. The mechanics from there are public and not the subject of this post.

The loud path is simpler and already documented. DuckDB extensions are native shared libraries. LOAD maps one with dlopen or LoadLibrary. Signature enforcement is on by default; allow_unsigned_extensions disables it. With the flag set, a LOAD of an attacker-supplied path is arbitrary native code execution by design, no memory corruption required. MITRE T1129, shared modules. The httpfs extension and enable_external_access extend reach outward to HTTP and S3. The corruption path is the quiet one. The quiet one is the point of this series.

DuckDB’s reach is the reason the surface matters. It is embedded in dbt, in analytics notebooks across Python and R, in ETL workers, in MotherDuck’s managed layer, and compiled to WebAssembly it runs inside browser tabs parsing Parquet client-side. Cloudflare and others run columnar analytics close to the data at scale, and the embedded model means the engine inherits the trust and the privileges of the process hosting it. Data files are shared, cached, downloaded, and reopened constantly. The unit of exchange - a database file or a Parquet object - is handled as data, not as executable input. That is the trust boundary violation at the centre of the class.

There is no named campaign to attribute here. No APT label, no in-the-wild marker. That absence is consistent with the bug class rather than evidence against it. Integrity degradation does not generate an incident the way ransomware does. MITRE T1203 covers the file-open-to-execution case. T1565.001, stored data manipulation, covers the quieter outcome - a corrupted value that survives the scan, passes the checksum because the corruption was crafted to pass it, and propagates into a result set nobody flags because the query returned successfully.

The scoring reflects the vector, not the severity of the outcome. File-parsing memory corruption in an embedded library carries a local attack vector and a user-interaction requirement - a victim has to open the file - which holds the class in the high-but-not-critical band rather than the 9.8 network-RCE tier. A characteristic vector reads AV:L/AC:L/PR:N/UI:R/S:U/C:H/I:H/A:H. Triage by base score alone deprioritises it. The score measures reachability. It does not measure how many pipelines reopen the same poisoned file.

Telemetry is where the embedded model hurts defenders. DuckDB is a library, so there is no database process to watch. EDR sees python.exe, node, the ETL container - never DuckDB. A corruption that crashes the engine surfaces as a segfault in the host application, attributed to the host, triaged as a stability bug, not a security event. On Windows that may produce a WER entry or a Sysmon Event ID 5 process termination for the host. On Linux it is a core dump most EDR never forwards.

Extension loading is the one event with a hook. Mapping a .duckdb_extension is an image load - Sysmon Event ID 7 - but only when image-load logging is enabled, and most deployments filter it for volume. On Linux the dlopen is effectively invisible to host telemetry. Reading an untrusted .duckdb or Parquet file leaves no native event by default; FileCreate, Sysmon Event ID 11, fires on write, not on read, and catching the read requires object-access auditing - Windows Security Event 4663 - configured on the path. The successful corruption is the worst case. No crash, no load, no anomalous file event. A wrong value in a column, a flipped boolean in a validity mask, an off-by-one in an aggregate. The query returns zero. Nothing fires.

The WebAssembly build narrows the visibility further. Compiled to WASM, DuckDB runs in the browser inside linear memory with the WASM sandbox around it. The sandbox contains native code execution; an out-of-bounds write corrupts the module’s linear memory, not the host. It does not contain integrity loss. A corrupted decode inside linear memory still produces wrong values, and a client-side analytics view built on attacker-supplied Parquet renders attacker-influenced numbers with no native crash and no host-side telemetry at all.

The patch boundary is narrow. Memory-safety fixes land per release, and running a current build closes specific overflow and bounds bugs as they are found. The residual exposure is structural and survives every patch. Opening an untrusted database or Parquet file runs C++ parsers against attacker-controlled bytes, and no version removes that vector - it is the function of the tool. allow_unsigned_extensions left false keeps the loud RCE path closed; flipped true, no patch matters. enable_external_access governs the outward reach. The corruption that does not crash is the residual that outlasts the fix - data that passed the checksum, scanned clean, and degraded silently. CVEs land on the crashes. The integrity loss that never threw is the part the advisory does not cover, and it is the part that decides whether the numbers downstream are real.

Part 2 takes the storage format apart block by block - header, checksum boundary, and the metadata structures read before the checksum validates.

DuckDB trusts persisted blocks attackers control

Keep Reading

Heartbleed was a C bug, not a web bug

ScStoragePathFromUrl overflows the stack on PROPFIND

NetScaler trusts snprintf, leaks adjacent heap memory

Stay in the loop