Translation

SubQ 1.1 Card: Linear-scaling sparse attention with 98% retrieval at 12M tokens [pdf]

SubQ 1.1 introduces a linear-scaling sparse attention mechanism that maintains 98% retrieval accuracy at 12 million tokens, significantly extending context length efficiency for large language models while reducing computational overhead compared to full attention methods.

A brief history of KV cache compression developments
5.0
KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.

SubQ 1.1 Card: Linear-scaling sparse attention with 98% retrieval at 12M tokens [pdf]

Related stories

A brief history of KV cache compression developments