GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

A brief history of KV cache compression developments

5.0

KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

Related stories

A brief history of KV cache compression developments