Tutorial

KV-Cache Wins You Can Feel: Building AI-Aware LLM Routing on Kubernetes

Abstract

Every LLM request carries invisible state: the KV-cache. Hit it, and your response is 10x cheaper and 50x faster. Miss it, and you're recomputing work you just did. Yet Kubernetes' default load balancing is cache-blind, scattering related requests across pods and destroying locality. The result? Your AI workloads are slower and vastly more expensive than they should be.

In this hands-on tutorial, we’ll fix that.

Attendees will deploy a distributed vLLM cluster, benchmark its performance, and visualize how cache-blind routing wastes GPU cycles. Then, we’ll replace the default Service with the Kubernetes Gateway API (Inference Extension) and deploy llm-d, a Kubernetes-native framework for distributed LLM inference with an AI-aware scheduler. By re-running the same benchmarks, you’ll see latency and throughput transform as prefix-reuse becomes first-class. You’ll leave with a working lab, dashboards, and a mental model for building cache-aware routing into any production AI stack.