Branchless Quicksort beats std::sort and pdqsort by dodging mispredictions
Original source
Branchless Quicksort faster than std:sort and pdqsort with C and C++ API
Hacker News →A new sorting library called blqsort claims to outperform std::sort and pdqsort on both Apple M1 and AMD Ryzen hardware when sorting 50 million doubles, with multithreaded variants running another 3-4x faster on the M1. The core idea is to eliminate conditional branches from the partitioning hot path, since branch mispredictions on modern CPUs cost more than the redundant work of always executing both arms of a comparison.
The implementation borrows the auxiliary buffer trick from fluxsort, using a 1024-element stack array to shuffle elements left or right of the pivot without branching, even though it more than doubles the copy operations. To stay robust on adversarial inputs, it falls back to heapsort when partitioning becomes unbalanced, groups duplicates, detects already-sorted runs, and uses median-of-medians for pivot selection. Small subarrays of 2-12 elements are handled by hand-tuned sorting networks built on a branchless sort-2 primitive.
For types that aren’t trivially copyable, like strings, the buffer approach loses its edge, so blqsort switches to a BlockQuicksort variant that permutes indices branchlessly and then moves the heavier payloads with fewer swaps. The library ships as single-header C and C++ implementations with single-threaded and threaded variants, and is drop-in compatible with std::sort-style usage for custom comparators and structs — a flexibility that SIMD-based sorters like Google Highway struggle to match.
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.