Talk

Fit-to-Serve: How a New DRA Capability for Dynamic Device Sharing Fits into Distributed LLM Serving

Abstract

llm-d is a community-driven effort to modernize large language models serving at scale—natively within Kubernetes. The core is modular architecture that decouples prefill and decode operations. This disaggregated design unlocks precise tuning of computing and network resources, tailored to the unique demands of each phase.

But here’s the twist: how precise can it be defined? A GPU unit? A MIG slice? Maybe even something finer? With a new capability proposed for Dynamic Resource Allocation (DRA) framework, resource capacities for compute and network devices can now be dynamically requested and adjusted on the fly. At the same time, the core DRA capability enables device selection based on fine-grained attributes—including topology awareness—eliminating the need for clunky hacks or rigid resource pools.

In this talk, we will demonstrate how a new capability of DRA makes the llm-d framework more feasible and cost-effective, explore the remaining challenges, share practical insights.