Skip to content

话TopicTracker

趋势分类关于

Loading deep-dive…

© 2026 TopicTracker

关于条款隐私

来自 HackerNews查看原文 ↗

译文语言译文语言

Luce KVFlash：在GPU上用72MiB的KV缓存实现256K上下文

Luce KVFlash 是一项针对大语言模型推理的优化技术，通过在 GPU 上仅使用 72MiB 的 KV 缓存即可支持高达 256K token 的上下文窗口。该技术显著降低了显存占用，使得长序列推理在消费级显卡上成为可能，同时保持推理速度和模型质量。

相关报道

A brief history of KV cache compression developments
5.0
KV cache compression techniques, including Multi-Query Attention (MQA), Grouped-Query Attention (GQA), Multi-head Latent Attention (MLA), and linear-attention hybrids, have evolved to reduce memory overhead in large language models. These developments have quietly enabled the long context windows required for modern agentic LLM applications by making key-value caching more efficient.