Saurabh Paul, Christos Boutsidis, et al.
JMLR
Every LLM request carries invisible state: the KV-cache. Hit it, and your response is 10x cheaper and 50x faster. Miss it, and you're recomputing work you just did. Yet Kubernetes' default load balancing is cache-blind, scattering related requests across pods and destroying locality. The result? Your AI workloads are slower and vastly more expensive than they should be.
In this hands-on tutorial, we’ll fix that.
Attendees will deploy a distributed vLLM cluster, benchmark its performance, and visualize how cache-blind routing wastes GPU cycles. Then, we’ll replace the default Service with the Kubernetes Gateway API (Inference Extension) and deploy llm-d, a Kubernetes-native framework for distributed LLM inference with an AI-aware scheduler. By re-running the same benchmarks, you’ll see latency and throughput transform as prefix-reuse becomes first-class. You’ll leave with a working lab, dashboards, and a mental model for building cache-aware routing into any production AI stack.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Evaline Ju, Kelly Abuelsaad
KubeCon EU 2026