Power-aware Deep Learning Model Serving with µ-Serve
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2024
As generative AI adoption surges, organizations face a critical challenge: how to serve large language models (LLMs) efficiently without overprovisioning costly graphics processing unit (GPU) resources.
llm-d is a Kubernetes-native distributed inference platform for scaling LLM serving by intelligently routing requests and disaggregating the inference process into prefill and decode stages across multiple vLLM instances.
The real challenge is scaling LLM inference without wasting GPU resources. It’s not just about NVIDIA Multi-Instance GPU (MIG) slices or time-slicing anymore—we need smarter, fine-grained sharing techniques that minimize idle capacity and maximize efficiency. This means dynamically allocating GPU memory and compute across multiple models, while ensuring cooperation with the Kubernetes scheduler so resource optimization never violates Pod requirements. The goal is smaller, smarter scaling strategies that go beyond static partitioning for cost-effective, high-performance inference.
In this lightning talk, we explore the Kubernetes community’s dynamic resource allocation (DRA) feature—why it’s a promising foundation for fine-grained GPU resource management and what extensions are still needed to achieve efficient LLM inference at scale. We’ll highlight capabilities such as PrioritizedList, PartitionableDevices, and the recently introduced ConsumableCapacity, and explain how they enable dynamic resource requests and adjustments. Through real-world examples, we’ll demonstrate how DRA can help serve multiple models of varying sizes efficiently and cost-effectively on Red Hat OpenShift.
Haoran Qiu, Weichao Mao, et al.
USENIX ATC 2024
Oleg Kolosov, Gala Yadgar, et al.
ICDCS 2023
Runyu Jin, Paul Muench, et al.
ICPE 2024
Shiqiang Wang, Mingyue Ji
NeurIPS 2022