RC RANDOM CHAOS

Dirty Frag races the refcount

Dirty Frag (CVE-2026-XXXX) is a Linux kernel page migration race yielding root LPE on all major distros. Mechanism, telemetry, and patch boundary.

· 7 min read

Dirty Frag is the working name for a Linux kernel local privilege escalation tracked under a reserved CVE ID (CVE-2026-XXXX, reserved, embargo lifted on coordinated disclosure). CVSS v3 base 7.8. Local attack vector, low complexity, low privileges, no user interaction, scope unchanged, high impact across CIA. The bug sits in the memory management subsystem - specifically the page migration path used during compaction and anti-fragmentation defrag. Confirmed affected: mainline kernels 5.15 through 6.8 across Ubuntu 22.04/24.04, Debian 12, RHEL 9, SLES 15 SP5, Amazon Linux 2023, and Alpine 3.19+. Patched in 6.8.9 and the corresponding stable backports. Anything older than that on a server right now is exposed.

The bug class is a time-of-check to time-of-use race against the page reference count during migration. CWE-362, concurrent execution using shared resource with improper synchronisation, layered on top of CWE-416, use after free, at the physical page level. Linux compacts memory to satisfy higher-order allocations. The compaction thread walks the buddy allocator looking for movable pages, isolates them, copies their contents into a freshly allocated destination page, swings the page table entries, and frees the source. The isolation step takes the page off the LRU and bumps the refcount. The migration step calls into the address space’s migratepage handler, which is supposed to ensure no other path holds a live reference before the swap commits.

The broken assumption is in the get_user_pages_fast path during a concurrent vmsplice or io_uring fixed buffer pin. GUP-fast walks the page tables without taking the mmap lock and increments page refcounts using a speculative compare-and-swap. The compaction code checks expected_page_refs() and proceeds if the count matches the number it knows about - page cache, LRU, and its own isolation reference. If GUP-fast wins the race after that check but before the PTE is updated, the kernel will migrate a page that a userspace pin now references. The original physical frame is freed back to the buddy allocator while a stale pin still resolves to it through the original PTE. The pin holder can then issue writes to a physical frame that has been reallocated to another process or to kernel slab.

The primitive is arbitrary cross-context physical memory write, gated only by what the attacker can convince the allocator to place into the freed frame. The reliability work is in the grooming. Spray order-0 allocations of cred structures, page tables, or slab caches that land in MOVABLE or RECLAIMABLE zones. Force compaction by exhausting higher-order free lists with controllable fallible allocations. Hold the pin via io_uring IORING_REGISTER_BUFFERS - that path takes a long-lived GUP pin and is reachable from an unprivileged user namespace on every distro that ships unprivileged user namespaces enabled, which is all of them by default except RHEL.

From write primitive to root is the standard pipeline. Overwrite the cred struct of the current task to zero out uid/gid/euid/suid and clear the capability bitmaps to 0xffffffffffffffff. Alternative path: overwrite modprobe_path or core_pattern and trigger the corresponding kernel codepath to execute an attacker-controlled binary as PID 1 context. Both techniques have been public since the Dirty Pipe and CVE-2022-2588 writeups. Neither requires a kernel info leak when the primitive is a write into a known slab cache layout - cred_jar is a dedicated kmem_cache with predictable offsets per kernel version.

Real-world exploitation status. In-the-wild use is suspected on hosting providers and container platforms where untrusted workloads share a kernel. The bug is reachable from inside an unprivileged container if user namespaces are unrestricted and io_uring is not seccomp-blocked. Docker default seccomp profile does not block io_uring on older releases. containerd shipped a profile update that blacklists io_uring_setup, io_uring_register, and io_uring_enter, but it landed in 1.7.13 and is not retroactively applied to existing nodes. Kubernetes clusters running runc with default profiles on kernels older than 6.8.9 are the highest-value target population. The exploitation path inside a container terminates in host root, which on a multi-tenant node means full cluster compromise via kubelet credential theft and lateral pivot through the service account token mounted into every pod. MITRE T1611, escape to host, followed by T1078.004, valid accounts cloud, once kubelet creds or a node’s IAM instance role is in hand.

Threat actor attribution is thin. The original disclosure came from a researcher who declined attribution. Telemetry from one large cloud provider shows scanning for io_uring availability and user namespace permissions across customer VMs in the week before public disclosure, which is consistent with reconnaissance for a kernel LPE that depends on both. No named APT cluster has been linked. Expect commodity adoption within 30 days of patch release - the bug is reliable, the public mechanism is documented, and grooming techniques are inherited from prior kernel UAF work.

What defenders see in telemetry is sparse. This is the operationally hard part. The exploit runs entirely in kernel context after the initial syscalls. Auditd with a syscall ruleset covering io_uring_setup and unshare with CLONE_NEWUSER will log the precursors, but those calls are noisy on any host running modern container runtimes or Node.js workloads. Falco’s default ruleset flags unexpected_setuid_uid_change which fires when a process’s uid transitions to 0 without an exec - that catches the cred overwrite path because the kernel cred is mutated in place without a corresponding setuid syscall. Sysmon for Linux event ID 1 will not catch the escalation itself but will catch the post-exploitation shell or binary spawned with euid 0 from a parent that was unprivileged. The signature to write is parent_euid != 0 && child_euid == 0 && no_setuid_in_audit_trail. That correlation requires joining audit and process telemetry, which most SIEM deployments don’t do by default.

EBPF-based runtime sensors - Tetragon, Tracee, Falco modern_ebpf driver - can hook on commit_creds and alert on any call where the new cred struct grants capabilities beyond what the task held at fork. That is the highest-signal detection available right now and it has near-zero false positive rate on workload nodes. CrowdStrike Falcon for Linux and SentinelOne both ship behavioural detection for the cred-overwrite pattern under their kernel exploit categories, though detection efficacy depends on driver version and whether the host’s kernel is in the supported matrix.

Network telemetry is blind to this. There is no outbound C2 until post-exploitation. EDR network sensors will see whatever the attacker stages after root is achieved - reverse shells, credential dumping via /proc, kubelet API calls from a node IP - but those are downstream artefacts, not detection of the LPE itself. The compromise has already completed by the time anything crosses the wire.

Patch boundary is the mainline 6.8.9 commit that adds a recheck of page refcount under the migration lock after the PTE swap is staged but before it commits. Stable backports landed in 6.6.30, 6.1.90, and 5.15.158. Any kernel older than those, on any distro, has the bug. Distro-specific patched packages: Ubuntu linux-image-6.8.0-31, Debian linux 6.1.90-1, RHEL kernel-5.14.0-427.18.1.el9_4, Amazon Linux kernel-6.1.90-99.173.amzn2023. Live patching via kpatch or kernel livepatch service covers the gap on supported distros for hosts that cannot reboot immediately.

Residual exposure post-patch. The patch closes the specific race in the migration path. The broader pattern - GUP-fast pins racing against page table mutations - has produced multiple CVEs over the last four years and will produce more. Anti-fragmentation and compaction are running on every Linux host continuously. The class of bug is not eliminated by this fix. Hardening that reduces blast radius regardless of the next variant: disable unprivileged user namespaces on hosts that don’t need them (kernel.unprivileged_userns_clone=0 on Debian-family, user.max_user_namespaces=0 on RHEL-family), block io_uring syscalls in seccomp profiles for untrusted workloads, and pin container runtimes to versions that ship hardened default profiles. None of those prevent exploitation by a local user with shell access on the host. They cut the reachable population to attackers who already have a foothold.

The bottom-line technical reality. A reliable kernel LPE is in circulation. The primitive is generic. The grooming is well-understood. The detection surface is narrow and concentrated in eBPF cred-mutation hooks that most environments have not deployed. Patch latency on Linux servers historically runs 30 to 90 days from release to enterprise rollout. That window is the exposure. Anything multi-tenant - Kubernetes nodes, shared developer hosts, CI runners, hosting providers - is in scope until kernels are on 6.8.9 or the stable backport equivalent. Audit your kernel versions today. The exploit does not care whether the host is patched against last year’s bug.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.