Translation

Simplified Sparse Attention via Gist Tokens

The paper introduces Simplified Sparse Attention (SSA), a method that uses learned "gist tokens" to compress key-value cache, reducing memory and computation in attention mechanisms while maintaining model quality.

Background

- The key idea: **Gist Tokens** are a small set of special tokens that summarize a text segment, replacing the full attention matrix in Transformer models with a much smaller one — drastically cutting compute cost while keeping long-context performance.<br>- **Sparse Attention** is an established technique to make transformers handle long sequences efficiently by only attending to a subset of tokens; the paper proposes a simpler, learnable alternative to handcrafted sparse patterns.<br>- **Transformer models** (like GPT, BERT, Llama) use an "attention mechanism" that compares every token to every other token — O(n²) complexity — which becomes prohibitively expensive for very long inputs (e.g., entire books or code repositories).<br>- The paper claims this method outperforms prior sparse attention techniques on standard benchmarks and is easier to implement, with implications for scaling context windows in large language models (LLMs) cost-effectively.