Entropy-Guided KV Caching for Efficient LLM Inference

Kim, Heekyum; Jung, Yuchul

Entropy-Guided KV Caching for Efficient LLM Inference

Heekyum Kim and Yuchul Jung ()
Additional contact information
Heekyum Kim: Department of Computer Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of Korea
Yuchul Jung: Department of AI Engineering, Kumoh National Institute of Technology, Gumi-si 39177, Republic of Korea

Mathematics, 2025, vol. 13, issue 15, 1-14

Abstract: Large language models (LLMs), built upon Transformer architectures, have demonstrated remarkable performance in a wide range of natural language processing tasks. However, their practical deployment—especially in long-context scenarios—is often hindered by the computational and memory costs associated with managing the key–value (KV) cache during inference. Optimizing this process is therefore crucial for improving LLM efficiency and scalability. In this study, we propose a novel entropy-guided KV caching strategy that leverages the distribution characteristics of attention scores within each Transformer layer. Specifically, we compute the entropy of attention weights for each head and use the average entropy of all heads within a layer to assess the layer’s contextual importance. Higher-entropy layers—those exhibiting broader attention dispersion—are allocated larger KV cache budgets, while lower-entropy (sink-like) layers are assigned smaller budgets. Instead of selecting different key–value tokens per head, our method selects a common set of important tokens per layer, based on aggregated attention scores, and caches them uniformly across all heads within the same layer. This design preserves the structural integrity of multi-head attention while enabling efficient token selection during the prefilling phase. The experimental results demonstrate that our approach improves cache utilization and inference speed without compromising generation quality. For example, on the Qwen3 4B model, our method reduces memory usage by 4.18% while preserving ROUGE score, and on Mistral 0.1v 7B, it reduces decoding time by 46.6%, highlighting entropy-guided layer analysis as a principled mechanism for scalable long-context language modeling.

Keywords: LLM; KV cache; transformer; LLM inference optimization; attention entropy; memory-efficient caching (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/15/2366/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/15/2366/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:15:p:2366-:d:1708520

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().