RC RANDOM CHAOS

Thirty years of weaponizing fork-exec

fork+exec inherits file descriptors, environment, and capabilities by default. That inheritance is the bug class behind Shellshock, runc CVE-2019-5736, and Symbiote.

· 6 min read

fork() followed by execve() is the canonical way to spawn a process on Unix. It is also a structural weakness that has been weaponised for thirty years and continues to be the entry point for container escapes, LD_PRELOAD hijacks, environment-variable injection, and credential theft from shared memory. The mechanism is not exotic. It is the default. That is the problem.

The model is documented in every operating systems textbook. fork() duplicates the calling process. The child receives a copy-on-write image of the parent’s address space, file descriptor table, signal handlers, environment, controlling terminal, working directory, real and effective UIDs, supplementary groups, namespaces, capabilities, and resource limits. execve() then replaces the text and data segments with a new program image while preserving most of that inherited state. The new binary inherits the parent’s posture by default. Anything the parent had, the child gets, unless the parent took explicit action to drop it before the exec.

The inherited state is the attack surface. File descriptors marked without FD_CLOEXEC remain open across exec. That includes sockets bound to privileged ports, file handles to /etc/shadow opened by a setuid parent, pipes connected to a parent’s stdin, and descriptors to memfd_create regions containing key material. Environment variables traverse the boundary intact. LD_PRELOAD, LD_LIBRARY_PATH, LD_AUDIT, PYTHONPATH, NODE_OPTIONS, PERL5OPT, and GIT_SSH_COMMAND are all attacker-controllable inputs in the wrong context. Signal handlers and signal masks persist where the new binary did not install them. The umask carries forward. The current working directory carries forward. CWE-200, CWE-272, CWE-426, CWE-427, CWE-15, CWE-454 - the bug classes are catalogued and the catalogue is long.

The canonical failure mode is the setuid binary that calls system(3). system() invokes /bin/sh -c with the supplied string. The shell parses PATH, IFS, and a long tail of environment variables before executing anything. Pre-glibc 2.0 IFS attacks against setuid binaries are the textbook case. Modern variants exist. A setuid wrapper calling popen() to invoke ifconfig with no absolute path is the same bug class. CVE-2019-14287 in sudo, CVE-2021-3156 Baron Samedit, and CVE-2023-22809 sudoedit all live in the same neighbourhood - privileged process, attacker-controllable execution context, insufficient sanitisation before the exec boundary. CVSS scores for that cluster range 7.8 to 10.0.

LD_PRELOAD is the cleanest demonstration. A non-privileged user sets LD_PRELOAD in their environment. A program runs. The dynamic linker maps the attacker’s shared object before libc. The attacker’s functions interpose on libc symbols. getuid() returns whatever the attacker wants. fopen() logs the path. crypt() leaks the input. The mechanism works against any dynamically linked binary that does not invoke __libc_enable_secure logic - which the linker enables only for setuid, setgid, or AT_SECURE binaries. For the other 99% of processes spawned via fork+exec, LD_PRELOAD is honoured. MITRE T1574.006, Dynamic Linker Hijacking. The Symbiote Linux rootkit documented by BlackBerry and Intezer used LD_PRELOAD interposition delivered via /etc/ld.so.preload to hook libc functions across every spawned process on the host. Detection required correlating ld.so.preload modification with subsequent process spawns. Most SIEMs did not have the rule.

The file descriptor leak path is where container escapes get interesting. A container runtime that calls fork+exec while holding a descriptor to the host filesystem - a directory file descriptor obtained via open(O_PATH) on the host root before namespace entry - and fails to set FD_CLOEXEC hands that descriptor to the child. CVE-2019-5736, the runc breakout, exploited a related primitive - overwriting the runc binary itself via /proc/self/exe from a process the runtime had just spawned into the container. The mechanism worked because the child process, after exec, retained a writeable handle path back to the host runtime binary through procfs. The fix was to have runc re-exec itself from a memfd copy before entering the container. The bug class is descriptor inheritance across a trust boundary that the developer assumed was sealed.

The environment variable path keeps producing CVEs. CVE-2014-6271 Shellshock was bash parsing function definitions out of exported environment variables on startup. Any CGI process spawned via fork+exec from a web server inherited the request’s HTTP headers as environment variables. Attacker-controlled User-Agent reached bash function parser. RCE on every Apache mod_cgi host on the internet. CVSS 10.0. The structural cause was not bash. The structural cause was that fork+exec is a transparent conduit for environment state, and the web server populated that state from untrusted network input. Bash was the trigger. The pipeline was the weapon.

The shared memory and IPC inheritance path is quieter and worse. POSIX shared memory segments, SysV shm, anonymous mmap regions marked MAP_SHARED - all survive fork. If the parent loaded a private key into an mmap region for performance, the child has it. If exec then loads an attacker-controlled binary into that process, the binary has read access to the key. Memory disclosure across the exec boundary is not a vulnerability in fork or exec. It is the documented behaviour. Process credential isolation in the Unix model assumes the parent knew what it was doing before the call. Most parents did not.

Namespace inheritance compounds the problem in containerised environments. A process running in the host network namespace that forks and execs without setns or unshare hands the child full host network reach. A privileged container that forks and execs a user-supplied binary without dropping CAP_SYS_ADMIN, CAP_NET_ADMIN, and the rest of the capability set hands the child the keys to the kernel. The capability inheritance rules - permitted, effective, inheritable, bounding, ambient - are documented in capabilities(7) and are correctly described as one of the most error-prone APIs in Linux. CVE-2022-0492 was cgroup release_agent abuse from inside a container with CAP_SYS_ADMIN. The exploit primitive was writing to a cgroup file. The capability that made it reachable was inherited across exec from a container runtime that did not drop it.

The replacement primitives exist and have existed for years. posix_spawn() with POSIX_SPAWN_SETSIGMASK, POSIX_SPAWN_SETSIGDEF, and explicit file action lists. Linux clone3() with CLONE_CLEAR_SIGHAND, CLONE_INTO_CGROUP, and explicit namespace flags. fexecve() to exec from a verified file descriptor rather than a path subject to TOCTOU. closefrom() and close_range() to seal descriptor inheritance in one syscall rather than iterating /proc/self/fd. execveat() with AT_EMPTY_PATH for descriptor-based execution. The kernel has shipped these. Application code mostly has not adopted them. Glibc system() still calls fork+exec under the hood. Most language runtimes - Python subprocess, Ruby Process.spawn, Node child_process - wrap posix_spawn where available but fall back to fork+exec on older platforms and inherit the same hazards when the caller does not pass explicit close-on-exec and environment-cleansing options.

In telemetry the abuse is partially visible and largely missed. Sysmon for Linux Event ID 1, process create, captures the parent-child relationship and the command line. It does not capture inherited file descriptors. It does not capture the environment block unless explicitly configured, and most deployments do not configure it because the volume is punishing. auditd SYSCALL records for execve include argv but truncate envp by default. eBPF-based EDR - Falco, Tetragon, CrowdStrike’s Linux sensor - can capture envp, capability sets at exec time, and the file descriptor table snapshot via tracepoints on sched_process_exec, but the rules to alert on suspicious LD_PRELOAD values, on inherited descriptors to sensitive paths, on capability sets retained across container entry, are not shipped by default. The data is reachable. The detection is not written. T1055 process injection, T1574 hijack execution flow, T1611 escape to host - the technique IDs exist, the coverage gap is in the rule pack.

The patch boundary on fork+exec is conceptual, not a CVE. Every privileged binary that calls a subprocess is a candidate. Every container runtime that spawns user code is a candidate. Every CGI-adjacent architecture that pipes network input into environment variables is a candidate. The residual exposure post-any-individual-patch is the next privileged binary, the next runtime, the next CGI handler that inherits state it did not audit. The model is the bug. Until the calling code treats every byte of inherited process state as attacker-influenced and explicitly cleanses it before the exec, the next CVE in this class is already written.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.