TurboQuant: Compressing AI vectors to 2-4 bits without losing accuracy

TurboQuant is a vector quantization scheme that compresses high-dimensional model data — KV caches, embeddings, attention keys — down to 2-4 bits per coordinate with provably near-optimal distortion. Unlike production quantizers (GPTQ, AWQ, KIVI, KVQuant) that pay a metadata tax by storing per-block scale and zero-point values in float16 to handle outlier channels, TurboQuant carries no per-block headers, requires no calibration, and needs no training pass.

The core insight is geometric: in high dimensions, applying a random rotation to any input vector produces coordinates that follow a known fixed distribution. That predictability lets a single codebook, designed once for the rotated distribution, work uniformly across every input. Outlier channels — the adversarial case that breaks naive fixed grids and forces production systems into per-block scaling — get smeared across coordinates by the rotation, removing the need for adaptive ranges.

The walkthrough builds the construction from primitives (vectors, MSE, bias-vs-variance, the central limit theorem, high-dimensional concentration) before showing how a fixed grid fails on spike inputs in 2D, 3D, and at d=128. The payoff is lower effective bits-per-value than block-scaled approaches at equivalent reconstruction quality, with unbiased inner-product estimation preserved — which matters because attention scores and nearest-neighbor lookups are inner products.