RC RANDOM CHAOS

Copy.fail has been root since 2017

Copy.fail turns an unprivileged Linux user into root via a copy_file_range credential cache flaw. Reachable since 2017. Telemetry gaps explained.

· 6 min read

Copy.fail is a local privilege escalation in the Linux kernel’s file-copy fast path. The bug has been reachable since 2017. A standard user account, a writable temp directory, and roughly forty lines of Python are sufficient to land a root shell on a kernel that has not picked up the fix. Distribution coverage is uneven. Long-term-support kernels on Debian stable, several Ubuntu LTS point releases, RHEL derivatives still on older 5.x trees, and a long tail of embedded and appliance images remain exposed. The CVSS v3 vector reflects what the primitive delivers - local attack vector, low privileges required, no user interaction, complete compromise of confidentiality, integrity, and availability. Base score sits in the 7.8 band, which understates the operational impact because the exploit is reliable, fast, and produces almost nothing in default audit telemetry.

The bug class is improper privilege handling in the kernel’s optimised copy path. The kernel exposes copy_file_range as a syscall that lets userspace ask the kernel to copy bytes between two file descriptors without round-tripping through user buffers. On the same filesystem, the implementation can short-circuit the byte copy and instead share or clone backing extents at the block layer. That fast path was written assuming the source and destination descriptors carry consistent privilege and security context. They do not, in every case. When a descriptor is opened against a file the caller has read access to, then passed through a path that re-enters the copy logic against a target whose security checks were resolved earlier under a different credential set, the kernel ends up writing extents using the privilege of one context into a target governed by another. The trust boundary the syscall depends on is the file descriptor’s resolved permissions at the moment of the copy. The defect is that the resolution is cached too early relative to the path the descriptor takes through namespace, mount, or setuid transitions. The result is a TOCTOU on the credential check, not on the file data - a more dangerous shape because it bypasses the standard mitigations that protect against double-fetch on user buffers.

The exploit primitive that falls out is an arbitrary write into a privileged-owned file with content the unprivileged caller chose. Pick a target the kernel writes during normal operation - anything root-owned and read by a setuid binary or a privileged service on its next invocation. The attacker prepares a source file holding the payload they want written. They obtain a descriptor on the target through whatever path the bug accepts - typically a sequence involving a bind mount, a user namespace, or a file opened then handed across a privilege boundary the kernel re-evaluates incorrectly. They issue copy_file_range from the source descriptor into the target descriptor at the offset they want overwritten. The kernel performs the write under the cached credential context. The target file now carries attacker-controlled bytes at attacker-chosen offsets. From there the path to root is mechanical. Overwrite a binary the attacker can cause to be executed as root. Overwrite a config file consumed by a privileged daemon. Overwrite an SUID helper at the entry point and trigger it. The Python proof-of-concept that has been circulating uses ctypes to call syscall(SYS_copy_file_range, …) directly. It is short because it has to be - the kernel does the work.

The path to weaponisation does not require kernel ROP, no SMEP or SMAP bypass, no KASLR leak. The attacker never executes code in kernel context. They convince the kernel to execute its own write path under the wrong identity. That is the entire chain. MITRE ATT&CK T1068, exploitation for privilege escalation, with no preceding T1203 step required because the attacker already has local code execution as the unprivileged user. In a multi-tenant context - shared developer hosts, CI runners, Kubernetes nodes where pod escape lands an attacker as a constrained user - the bug closes the gap from container-level access to host root in a single syscall sequence. On Kubernetes specifically, the relevance is that a container breakout that drops the attacker into the node’s user namespace at a low UID is no longer a partial win. Copy.fail finishes the chain.

Real-world exploitation is consistent with what red teams report from internal engagements through the first half of 2026. The bug surfaced in public advisories without an attached in-the-wild campaign attribution, which is typical for local LPEs - they appear after initial access, in the post-exploitation phase, and rarely show up in network telemetry that vendors mine for threat-actor write-ups. What is observable in incident response is the artefact: a process running under an unprivileged UID immediately followed by a privileged process spawning from a path that should not have been writable to that UID. Cobalt Strike’s Linux post-exploitation kits and Sliver implants both pick up local LPE primitives quickly once they stabilise. Public PoC code that lands on GitHub in a usable form has, in past LPE precedents - Dirty COW, Dirty Pipe, OverlayFS GameOver(lay) - been integrated into commodity tooling within weeks. There is no reason to expect Copy.fail to follow a different timeline.

What defenders see in telemetry is the part that should drive detection engineering. The syscall itself is unremarkable. copy_file_range is a normal Linux syscall used by cp on modern coreutils, by container runtimes, by backup utilities, by anything that touches large files on a single filesystem. Auditd default rule sets do not record it. Sysmon for Linux records process and file events but does not, by default, distinguish copy_file_range from a write. EDR vendors that hook syscalls via eBPF - CrowdStrike, SentinelOne, Elastic Defend, Sysdig - can observe it, but their default policies do not flag it. The signal is not the syscall. The signal is the combination: an unprivileged process invokes copy_file_range with a target descriptor that resolves to a root-owned file outside the caller’s normal write scope, followed within seconds by execution of that file under a privileged UID. That correlation is detectable. It requires a rule that joins the copy event against a process-execution event keyed on file path and UID transition. Few production detection stacks have that rule today.

What fires in default configurations is thin. Falco’s standard ruleset will alert on writes to sensitive paths - /etc/passwd, /etc/shadow, /etc/sudoers - if the attacker chooses those targets. Most attackers will not. Writing a tampered binary into /usr/local/bin or replacing an SUID helper in a less-watched path produces no Falco alert under default rules. Auditd watches on /etc/shadow will fire if that is the target, but the kernel records the write as the calling UID, which the analyst then has to correlate against the fact that the file is root-owned and the caller had no write permission on the inode. That correlation is not automatic. The gap is the same gap that allowed Dirty Pipe to spread: the kernel performs the write, the audit subsystem records the write under the calling identity, and the security check that should have prevented it has already been bypassed before the audit hook runs. Detection has to move to the behavioural layer - UID transitions, unexpected privileged-process parents, file integrity monitoring on system binary paths - because the syscall layer is silent on the abuse.

The patch boundary is mainline kernel commits in the copy_file_range and remap_file_range paths that re-resolve credentials at the point of the write rather than relying on the cached descriptor context. Stable backports landed in the 6.x series first, then 5.15, 5.10, and 5.4 LTS trees. Distribution kernels lag the upstream stable trees by days to weeks under normal release cadence and by months on appliances and embedded images that do not auto-update. Verifying the fix on a running host means checking the running kernel version against the distribution’s CVE tracker entry - Ubuntu’s USN, Debian’s DSA, Red Hat’s CVE database - not against the upstream stable tag, because distributions backport selectively. Residual exposure after patching is limited to the same class of bugs in adjacent syscalls - sendfile, splice, the io_uring copy operations - which share enough of the underlying infrastructure that the same credential-caching pattern could be present elsewhere. The fix for Copy.fail closes the specific path. The bug class remains worth auditing.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.