TOPIC

A brief history of KV cache compression developments

0.4

KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.

5 items2 sourcesFirst seen Jun 16Last activity Jun 16

Sources

hn4martinalderson-com1

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

A new Transformer implementation called GateGPT achieves 56,000 tokens per second using KV cache on an FPGA running at 80 MHz.

hnJun 16tech

4.0

SubQ 1.1 Card: Linear-scaling sparse attention with 98% retrieval at 12M tokens [pdf]

SubQ 1.1 introduces a linear-scaling sparse attention mechanism that maintains 98% retrieval accuracy at 12 million tokens, significantly extending context length efficiency for large language models while reducing computational overhead compared to full attention methods.

hnJun 16tech

5.0

Luce KVFlash: 256K context with 72MiB of KV cache on the GPU

Luce KVFlash is a memory-efficient optimization enabling 256K context windows using only 72 MiB of KV cache on the GPU. It reduces memory consumption for long-sequence inference by compressing key-value cache storage.

hnJun 16tech

7.5

Deep-dive failed to generate.

Timeline

June 16, 2026

Subquadratic – Introducing SubQ 1.1 Small
2.0
Subquadratic released SubQ 1.1 Small, a 1.5B open-weight language model using a soft-moe-2x8 architecture. It outperforms larger models like Gemma 2 2.6B and Phi-2 2.8B on several benchmarks. The model uses subquadratic soft-MoE layers (MMA and MMAM) for improved efficiency.
hnJun 16, 2026#Tech

June 15, 2026

A brief history of KV cache compression developments
5.0
KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.
martinalderson-comJun 15, 2026#Tech

Timeline

June 16, 2026

Subquadratic – Introducing SubQ 1.1 Small
2.0
Subquadratic released SubQ 1.1 Small, a 1.5B open-weight language model using a soft-moe-2x8 architecture. It outperforms larger models like Gemma 2 2.6B and Phi-2 2.8B on several benchmarks. The model uses subquadratic soft-MoE layers (MMA and MMAM) for improved efficiency.
hnJun 16, 2026#Tech

June 15, 2026

A brief history of KV cache compression developments
5.0
KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.
martinalderson-comJun 15, 2026#Tech