Data Compression Explained: Why Perfect Compression Is Mathematically Impossible
Matt Mahoney’s free 2012 book is a self-contained primer on how data compression actually works, aimed at programmers with some math background. Every compressor reduces to two parts: a model that estimates how likely each symbol is, and a coder that hands the likeliest symbols the shortest codes — Morse code is the classic example, giving E and T short sequences and Q and Z long ones. Coding is a solved problem with known optimal solutions, but modeling is not: optimal modeling has been proven uncomputable, which is why the author frames prediction as both an art and an artificial-intelligence problem. Lossy compression adds a transform that separates perceptually important data from the rest, and judging what the eye and ear won’t miss is itself an AI problem.
The book’s central result is a hard limit from information theory: no universal compressor can shrink every input. A simple counting argument proves it — there are 2^n strings of length n but only 2^n − 1 shorter strings, so any scheme that compresses some inputs must expand others, and random or already-compressed data can’t be squeezed further. The practical ceiling is Shannon’s entropy: a symbol of probability p can’t be coded in fewer than log2(1/p) bits on average. Worked examples on the digits of π show Huffman coding hitting 3.4 bits per digit and grouped codes approaching the 3.3219-bit entropy floor, illustrating how close real codes can get to the theoretical limit but never beat it.
From these foundations the text surveys the working machinery of the field — RLE, the LZ77 family (deflate, LZMA, LZX, snappy, deduplication), LZW dictionary coding, the Burrows-Wheeler transform and bzip2, predictive and color transforms, Huffman pre-coding, and lossy methods. It remains a widely cited reference because it connects the everyday algorithms behind ZIP and JPEG to the information theory that bounds them, making clear why compression is ultimately a prediction problem with no free lunch.
Read the full article
Continue reading at Hacker News →This is an AI-generated summary. Read the original for the full story.