GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz
A new Transformer implementation called GateGPT achieves 56,000 tokens per second using KV cache on an FPGA running at 80 MHz.
A new Transformer implementation called GateGPT achieves 56,000 tokens per second using KV cache on an FPGA running at 80 MHz.
KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.