Mistral's Leanstral 1.5: an open formal-proof model that finds real-world bugs

Mistral has released Leanstral 1.5, an Apache-2.0 model for automated theorem proving in Lean 4. It uses a mixture-of-experts design with 119B total parameters but only 6B active, and was trained in three stages—mid-training, supervised fine-tuning, and reinforcement learning with CISPO—across two environments: a multiturn prover that iterates against Lean compiler feedback, and a filesystem-based code agent that edits files, runs commands, and queries the Lean language server like a working developer. On benchmarks it saturates miniF2F at 100%, solves 587 of 672 PutnamBench problems, and sets new state-of-the-art marks on the graduate- and PhD-level FATE-H (87%) and FATE-X (34%) algebra tests. Its main selling point is cost and scaling: it matches or beats far pricier systems like Seed-Prover 1.5 at roughly $4 per problem, and its accuracy climbs monotonically as the token budget grows from 25k to 4M tokens per attempt.

The more consequential result for engineers is that a math-focused prover transfers to code verification. Leanstral proved O(log n) time-complexity bounds for a real AVL-tree implementation over a 2.7-million-token run spanning 22 context compactions. In an automated bug-hunting pipeline—Aeneas translates Rust to Lean, Leanstral infers intended properties and tries to prove or disprove them—it flagged 47 violated properties across 57 repositories, 11 of which were genuine bugs and 5 previously unreported. One was an integer-overflow flaw in the zigzag-decoding sign function of the datrs/varinteger library, an edge case at U64::MAX that crashes in debug and silently corrupts in release, the kind of thing fuzzing typically misses.

The significance is less about leaderboard numbers than accessibility. Formal verification has historically been expensive and specialist-only; shipping a competitive prover under a permissive license with free weights on Hugging Face and a free API lowers the barrier to applying machine-checked correctness proofs to production code. The bug-discovery results suggest formal methods are becoming a practical complement to testing and fuzzing rather than a purely academic exercise.