Needle: A 26M-Parameter Model Distilled from Gemini for On-Device Tool Calling

Cactus Compute has released Needle, a 26-million-parameter model distilled from Gemini 3.1 that specializes in single-shot function calling. Built on a new “Simple Attention Network” architecture with a 12-layer encoder feeding an 8-layer decoder through cross-attention, the model was pretrained on 200B tokens across 16 TPU v6e chips in 27 hours, then post-trained on 2B tokens of function-call data in 45 minutes. Weights, dataset generation code, and tooling are open source.

In benchmarks for narrow tool-calling tasks, Needle outperforms larger small models like FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M, while running at 6000 tokens/sec prefill and 1200 tokens/sec decode on Cactus’s runtime. The tradeoff is scope: Needle is not a conversational model, and the team warns small models can be brittle outside their training distribution.

The release targets on-device AI for phones, watches, and glasses, with a local playground UI for testing and fine-tuning on custom tool schemas at the click of a button. It’s positioned as an experimental probe into whether tiny, task-specific models can replace cloud LLM calls for the agentic plumbing layer.