Saurabh Paul, Christos Boutsidis, et al.
JMLR
Large language model workloads are hard to run efficiently: GPU memory is limited, traffic patterns shift quickly, and benefit from stateful routing. LLM‑D is a distributed inference system built to solve these problems with the help of the Gateway API Inference Extension. It introduces several techniques—cache‑aware request routing, prefill/decode split execution, locality‑aware scheduling, and dynamic worker scoring—that improve throughput and reduce latency for real‑world inference traffic.
In this session we'll explain at a high level why these techniques makes sense in the first place and how you can take advantage of these benefits with your existing Istio installation. Istio can be your gateway (controller) to efficient inference serving with llm-d.
Saurabh Paul, Christos Boutsidis, et al.
JMLR
Joxan Jaffar
Journal of the ACM
Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Evaline Ju, Kelly Abuelsaad
KubeCon EU 2026