Epicure: Skip-gram embeddings map the latent geometry of 1,790 cooking ingredients

Researchers built Epicure, three sibling ingredient embedding models trained from scratch on 4.14 million recipes pulled from 11 sources across nine languages. A normalization pipeline backed by an LLM collapses messy free-text ingredient strings down to 1,790 canonical entries, producing a compact vocabulary suitable for representation learning.

The three variants share architecture and hyperparameters but differ in what graph their random walks traverse. Cooc walks a 203,508-edge ingredient co-occurrence graph derived from NPMI statistics. Chem walks an 80,019-edge typed graph linking ingredients to 2,247 flavor compounds across 15 categories from FlavorDB. Core blends the two by injecting ingredient-ingredient walks at a controlled mixing ratio, positioning each model at a different point on the chemistry-versus-context spectrum.

The release ships the trained vectors alongside evaluation artifacts including linear probes, ICA factor alignments, Procrustes sensory comparisons, and WEAT bias measurements, inviting downstream analysis of how culinary structure emerges from co-occurrence statistics versus molecular chemistry.