Seetharami Seelam, Apoorve Mohan, et al.
ISCA 2023
The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. While CHIME and Sherman outperform Redis in search efficiency, they suffer from highly variable tail latency (p99) during KVC block metadata retrieval, due to their lack of optimization for LLM-specific KVC access patterns. Our work identifies key challenges in this domain, underscores the future need for a specialized RDMA-enabled distributed caching system with optimized metadata management based on LLM workload patterns, and provides insights into designing improved KVC management systems for scalable, low-latency inference.
Seetharami Seelam, Apoorve Mohan, et al.
ISCA 2023
Pooja Aggarwal, Ajay Gupta, et al.
ICSOC 2020
Kfir Toledo, Pravein Govindan Kannan, et al.
CLOUD 2025
Pratik Mishra, Caner Gözübüyük, et al.
IAAI 2026