Fine-grained scaling for LLM inference on Red Hat OpenShift: The promise of DRA

Sunyanan Choochotkaew; Tatsuhiro Chiba

Red Hat Summit 2026

Talk

11 May 2026

Fine-grained scaling for LLM inference on Red Hat OpenShift: The promise of DRA

Abstract

As generative AI adoption surges, organizations face a critical challenge: how to serve large language models (LLMs) efficiently without overprovisioning costly graphics processing unit (GPU) resources.

llm-d is a Kubernetes-native distributed inference platform for scaling LLM serving by intelligently routing requests and disaggregating the inference process into prefill and decode stages across multiple vLLM instances.

The real challenge is scaling LLM inference without wasting GPU resources. It’s not just about NVIDIA Multi-Instance GPU (MIG) slices or time-slicing anymore—we need smarter, fine-grained sharing techniques that minimize idle capacity and maximize efficiency. This means dynamically allocating GPU memory and compute across multiple models, while ensuring cooperation with the Kubernetes scheduler so resource optimization never violates Pod requirements. The goal is smaller, smarter scaling strategies that go beyond static partitioning for cost-effective, high-performance inference.

In this lightning talk, we explore the Kubernetes community’s dynamic resource allocation (DRA) feature—why it’s a promising foundation for fine-grained GPU resource management and what extensions are still needed to achieve efficient LLM inference at scale. We’ll highlight capabilities such as PrioritizedList, PartitionableDevices, and the recently introduced ConsumableCapacity, and explain how they enable dynamic resource requests and adjustments. Through real-world examples, we’ll demonstrate how DRA can help serve multiple models of varying sizes efficiently and cost-effectively on Red Hat OpenShift.

Conference paper