RC RANDOM CHAOS

Google killed io_uring fleet-wide in 2023

io_uring runs file and network operations off the syscall path, blinding seccomp, auditd, and EDR, while epoll stays observable to defenders.

· 8 min read
Google killed io_uring fleet-wide in 2023

Google disabled io_uring across ChromeOS, Android, and its production Linux fleet in 2023. The reason was not throughput. In Google’s kCTF vulnerability reward program, roughly 60 percent of the successful kernel exploit submissions over the reviewed period abused io_uring. One subsystem accounted for the majority of working local privilege escalations against a hardened kernel. That ratio is the whole argument.

epoll and io_uring answer the same engineering question. Both manage I/O readiness for processes that hold thousands of concurrent file descriptors. They sit at opposite ends of kernel security visibility. epoll is syscall-bound and observable. io_uring is shared-memory and largely invisible to the controls built to watch syscalls. The migration from one to the other, done for performance, retired the telemetry defenders were relying on without anyone filing a change request for it.

epoll is the older interface and the better-understood one. A process calls epoll_create1 to get an eventpoll instance, registers descriptors with epoll_ctl(EPOLL_CTL_ADD), and blocks on epoll_wait for readiness. Internally the kernel holds a struct eventpoll containing a red-black tree of registered interests and a ready list. Each registered descriptor produces a struct epitem. Each epitem links into the polled file’s wait queue through a struct eppoll_entry, whose callback ep_poll_callback fires when the underlying file becomes ready and moves the epitem onto the ready list. Three operations, three distinct syscalls. Every state transition crosses the syscall boundary where seccomp-bpf, auditd, and syscall-hooking EDR are positioned.

epoll’s relevance to exploitation is not that it is a rich bug source. It is that it is a reliable kernel heap primitive. struct epitem and struct eppoll_entry are fixed-size allocations from dedicated slab caches. With slab cache merging enabled in most distribution kernels, those caches alias the general kmalloc caches of matching size, which makes epoll object allocation a controllable way to shape the kernel heap. An attacker holding a use-after-free or out-of-bounds write in an unrelated subsystem uses epoll to fill freed slots with attacker-influenced objects, place a target structure at a predictable offset, or reclaim a dangling pointer. The eppoll_entry carries a wait queue entry with a function pointer. That pointer is a known control target when a UAF lets the attacker overlap a freed object with a fresh epoll allocation. epoll is the grooming tool, not the vulnerable component. MITRE maps the outcome to T1068, exploitation for privilege escalation. The bug lives elsewhere. epoll makes the bug reliable.

The reason epoll keeps showing up in kernel exploitation writeups, over alternatives like msg_msg or signalfd, is allocation control. The attacker decides exactly when an epitem is allocated and when it is freed, the size class is stable across kernel versions, and the eppoll_entry gives a function pointer inside a structure the attacker can spray on demand. The ready list also supplies a kernel read side channel in certain UAF configurations, because the contents that get linked and surfaced through epoll_wait can be steered when a confused object overlaps an epitem. None of this is a vulnerability in epoll. It is epoll being a well-behaved, predictable allocator that a kernel memory corruption elsewhere borrows to turn a fragile bug into a stable arbitrary read and write. That is the offensive value, and it is why an exploit can chain a UAF in an obscure driver to root using a subsystem present in every general-purpose Linux build.

io_uring is a different architecture and a different problem. The process calls io_uring_setup, which returns two ring buffers mapped into shared memory, a submission queue and a completion queue. Work is submitted by writing submission queue entries into that shared memory and, in the common case, calling io_uring_enter to tell the kernel to process them. With SQPOLL mode a kernel side thread polls the submission queue and even io_uring_enter becomes optional. The operations themselves, openat, read, write, connect, sendmsg, recvmsg, and dozens more, are described as opcodes inside the SQEs. They are not individual syscalls. The kernel dispatches them, and bounded or blocking work is handed to kernel worker threads in the io-wq pool.

That dispatch model is the bypass of traditional kernel-level security controls the topic points at, and it is real. seccomp-bpf filters by syscall number at the syscall entry boundary. A seccomp policy can permit io_uring_setup and io_uring_enter, or deny them, but it cannot inspect the opcodes inside an SQE. A process confined by a seccomp filter that forbids openat or connect at the syscall layer could, on an unrestricted io_uring instance, perform the equivalent operation as an SQE opcode that never transits the filtered path. The filter is enforced against syscalls. io_uring executes file and network operations without issuing those syscalls. The same gap applies to auditd, which keys on syscall events, and to any EDR sensor whose Linux coverage is built on syscall tracepoints or kprobes at the syscall entry. The actual openat performed by an io-wq worker on behalf of a malicious process generates no openat syscall record. Mapped to ATT&CK, this is T1562 territory, impairing defenses, achieved not by disabling a sensor but by operating beneath where the sensor watches. Paired with the heap-grooming and LPE work, the same instance supports T1014 style concealment of the operations that follow root.

The bug surface is the other half. io_uring is large, asynchronous, and full of object lifetime complexity, which is precisely the condition that produces use-after-free and type confusion. CVE-2021-41073 was a type confusion in the read path’s loop_rw_iter that yielded a use-after-free and local privilege escalation, high severity, local vector. CVE-2022-29582 was a use-after-free in io_uring timeout handling, again local privilege escalation. CVE-2023-2598 was a flaw in fixed buffer registration that, with specific huge-page conditions, exposed access to physical memory outside the registered region, an arbitrary read and write primitive into physical memory from an unprivileged context. These are not exotic edge cases. They are the recurring shape of the subsystem, and they are why Google’s kCTF data skewed the way it did and why Google pulled io_uring out of its own attack surface rather than wait for the next one.

The telemetry difference is where defenders feel this. An epoll-grooved kernel exploit still leaves syscall residue. epoll_ctl and epoll_wait appear in auditd records, in eBPF programs attached to syscall tracepoints, and in Sysmon for Linux where the relevant syscalls are configured, though Sysmon for Linux covers a narrow set of events and will not characterize the grooming as malicious on its own. The point is that the events exist and can be correlated. An io_uring-driven operation does not leave that residue. The submission shows, at most, as an io_uring_enter, and under SQPOLL not even that. The downstream openat, read, or connect carried out by the io-wq pool produces no corresponding syscall audit event. A file read that a file integrity monitor would normally catch never fires. An outbound connection that a syscall-based egress control would normally see is absent from that view. Data exfiltration over an io_uring socket and credential file access over io_uring openat both run under the radar of sensors positioned at the syscall layer.

The tell, where one exists, is in the kernel worker threads. io_uring spawns named kernel threads for its worker pool, historically io_wqe_worker and in current kernels iou-wrk-, with iou-sqp- for the SQPOLL thread. Their presence in a process tree indicates an active io_uring instance and the volume of asynchronous work behind it. Detection that actually covers the operations has to move off the syscall path and onto io_uring’s own tracepoints, io_uring_submit_sqe and io_uring_complete, which expose the opcode and the target. Most deployments do not collect those. The instrumentation built over a decade for syscall monitoring does not reach into the ring.

For a detection engineer the practical consequence is a coverage hole that no correlation rule over existing syscall data can fill, because the events were never emitted. A SIEM rule looking for a sensitive file read followed by an outbound connection assumes both actions produced openat and connect records. Over io_uring they did not, so the rule sees a process that opened nothing and connected nowhere. The achievable detections are indirect and coarse. Flag processes that create io_uring instances when the workload has no legitimate reason to, which for most line-of-business and interactive software is the normal case. Alert on the iou-wrk and iou-sqp kernel threads parented to processes outside a known allowlist of high-performance services. Where the eBPF budget exists, attach to the io_uring_submit_sqe tracepoint and record the opcode, which is the only place the openat or connect intent is visible at all. Treat the appearance of io_uring under a process that historically only made plain syscalls as an anomaly worth surfacing, because a shift in how a known binary talks to the kernel is itself signal.

The patch boundary is the part that does not behave like a normal CVE. The individual io_uring use-after-frees and type confusions are fixed in the kernel versions that carry their patches, and those should be applied. The architectural exposure is not a version number. As long as a kernel exposes unrestricted io_uring to untrusted local code, it carries both a bug-dense local privilege escalation surface and a syscall-audit blind spot, and that holds after every individual bug is patched, because the exposure is the design, not the defect. The controls that change the posture are coarse. The kernel.io_uring_disabled sysctl, added in 6.6, takes a value of 2 to remove the interface entirely or 1 to restrict it to members of the io_uring_group. IORING_REGISTER_RESTRICTIONS lets a process that must use io_uring lock its own instance down to a permitted opcode set before handing control to less-trusted code. A seccomp policy on its own does not close the gap, because seccomp filters the two io_uring syscalls and not the opcodes they carry, so it has to be paired with io_uring restriction or disablement to mean anything against this path.

epoll stays observable because it never left the syscall boundary. io_uring delivered the performance the workloads asked for and, in the same move, relocated file and network operations to a path that seccomp does not filter, auditd does not record, and most EDR does not instrument. The exploitation surface and the detection gap are the same architectural fact viewed from the offensive and defensive sides. The version that ships the fix for the last io_uring bug still ships io_uring.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.