Restartable Sequences: The Lockless Trick Unlocking 40x Speedups on Many-Core CPUs

Linux’s restartable sequences (rseq), available since kernel 4.18 in 2018, let threads mutate per-CPU data structures without locks or atomics by cooperating with the kernel scheduler. A thread registers a 32-byte TLS region; the kernel writes the current CPU number there on every reschedule and, if it preempts the thread inside a registered critical section, redirects execution to an abort handler that restarts the operation. This sidesteps the cacheline-contention penalty that makes atomics crawl on high-core-count machines.

Justine Tunney reports dramatic gains in cosmopolitan’s malloc: 3x on a 4-core Raspberry Pi 5, 34x on a 128-core Ampere Altra, and 43x on a 96-core AMD Threadripper Pro 7995WX, all compared to sharded mspace allocators keyed by sched_getcpu(). The reason mutex-per-CPU sharding still loses is cost asymmetry — even an uncontended lock runs around 15ns versus roughly 1ns for a thread-local push or pop — and prior workarounds like sched_setaffinity or RTOS-style scheduler control are brittle. Rseq removes the mutex entirely while remaining safe under preemption.

Adoption so far is limited to tcmalloc, jemalloc, glibc, and cosmopolitan, and using rseq currently requires handwritten assembly because no mainstream systems language can express the restart contract. Tunney argues this will change as cheap 128- and 192-core chips arrive, and that rseq belongs in future OS APIs, language designs, and standard data-structure libraries as the default way to scale lockless code.