Talk

Cloud Native Theater | Istio Day: Running State of the Art Inference with Istio and LLM-D

Abstract

Large language model workloads are hard to run efficiently: GPU memory is limited, traffic patterns shift quickly, and benefit from stateful routing. LLM‑D is a distributed inference system built to solve these problems with the help of the Gateway API Inference Extension. It introduces several techniques—cache‑aware request routing, prefill/decode split execution, locality‑aware scheduling, and dynamic worker scoring—that improve throughput and reduce latency for real‑world inference traffic.

In this session we'll explain at a high level why these techniques makes sense in the first place and how you can take advantage of these benefits with your existing Istio installation. Istio can be your gateway (controller) to efficient inference serving with llm-d.