Researchers Propose 'Sleep' Cycles to Fix Transformer Long-Context Limits

A new arXiv paper tackles the well-known scaling problem of transformer attention over long contexts by borrowing a page from biology. The authors propose a consolidation phase where the model periodically pauses, runs N offline recurrent passes over accumulated context, and writes the results into persistent fast weights inside its state-space model blocks. The key-value cache is then cleared, shifting heavy computation to these ‘sleep’ windows while keeping wake-time inference latency unchanged.

On benchmarks including cellular automata, multi-hop graph retrieval, and a math reasoning task that defeats both standard transformers and SSM-attention hybrids, the sleep-augmented model succeeds where baselines fail. Longer sleep durations yield better results, with the steepest improvements on problems demanding deeper reasoning chains.

The approach reframes the context-length tradeoff: rather than expanding attention windows or stacking retrieval hacks, it treats consolidation as a first-class operation, learned via a local update rule. If the technique generalizes, it points toward architectures that amortize reasoning across explicit offline phases instead of cramming everything into a single forward pass.