Technical note
4 minute read

IBM and UIUC develop an orchestration system to serve LLMs more efficiently

Large language models (LLMs) such as OpenAI GPT-4, IBM Granite, Google Gemini, and Meta Llama have enabled novel capabilities in a wide range of AI applications such as chatbots and coding assistants. These base models are further fine-tuned to support specialized tasks, such as copywriting, financial planning, code generation, and document summarization.

To meet enterprise and consumer demands, serving multiple models for various applications with latency-oriented service-level objectives (SLOs) has become increasingly critical. Early work in this area has largely focused on serving interactive requests, such as chatbots, with tight latency SLO requirements in the order of seconds.

Due to the recent growth of a much broader range of enterprise use cases, there is a need to serve batch requests, which have relaxed SLOs, in the order of minutes to hours. However, these SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters. This necessitates the use of an orchestration strategy that involves appropriate queue management, routing, and autoscaling. Our team at IBM Research with support from researchers at the University of Illinois Urbana-Champaign, has been working to fill this urgent need with two new projects, QLM and Chiron.

How are latency SLOs defined?

There are two primary metrics latency metrics for LLM inference latency. The first is TTFT (time to first token), which is the time required to complete the prefill step and generate the first token. The other is (inter-token latency), which is the time required to generate each subsequent token in the decode phase. These two latency requirements together form the SLO for the request.

Overview of Chiron and QLM

We provide two versions of the system, Chiron and QLM (which is derived from "Queue Management for SLO-Oriented Large Language Model Serving"), depending on the specific deployment use case. In situations where addition of instances is possible with resource autoscaling, we can use Chiron. On the other hand, when the deployment uses fixed capacity, we can use QLM.

Chiron

Screenshot 2025-04-18 at 11.15.32 AM.png

The figure above provides an overview of the first approach: Chiron.

Chiron follows a hierarchical design to meet TTFT and ITL SLOs, while maximizing throughput in two ways. It scales the batch size of an individual instance via a local autoscaler, and also scales and orders requests for the interactive, mixed, and batch instances via a global orchestrator.

Each request is preferentially routed to its own instance type (interactive requests to interactive instances and batch requests to batch instances) leading to non-uniform routing requests in Chiron. If capacity is unavailable on its own instance type, they are routed to the mixed instances. Mixed instances enable multiplexing between interactive and batch requests and drive up overall cluster utilization. For interactive requests, the mixed instances can handle unpredictable spikes in request arrivals. For batch requests, the mixed instances provide additional running capacity when sufficient interactive requests are not present.

To enable this multiplexing between interactive and batch requests, while ensuring the immediate execution of interactive requests, mixed instances are preemptible. This means interactive requests can evict out batch requests and send them back into the global queue. To prevent throughput drop from such an eviction, we enable fast restart: We save the KV cache by migrating it to CPU memory.

The global autoscaler is based on the request waiting time estimation for the request queue. As the queue size grows larger, statistical effects of continuous batching allows Chiron to create a tighter bound on waiting time.

QLM

Screenshot 2025-04-18 at 11.15.11 AM.png

The figure above provides an overview of the second approach: QLM, designed for fixed capacity deployments. In addition to routing and eviction from Chiron, QLM also uses model swapping to share multiple models between the same serving instance.

Every incoming request is grouped with other requests that share common performance characteristics (such as model type, SLO value, and token distribution) to form request groups. Request groups are a useful abstraction to apply waiting time estimation. Requests in a group are then assigned to a virtual queue, representing a waiting queue for an LLM serving instance in the cluster. The ordering of the request groups in a virtual queue determines the execution ordering of the requests on the corresponding LLM serving instance. While requests are assigned to groups in a first-come-first-serve manner, the groups in a virtual queue are reordered by the global scheduler to maximize the SLO attainment for all requests being served.

Screenshot 2025-04-18 at 11.13.34 AM.png
A sample execution workflow

In the above figure, we show an example workflow for Chiron and compare it against Llumnix, a state-of-the-art LLM orchestration system. Initially, the workload comprises only interactive requests arriving with a Gamma distribution with mean of 30 requests per second and CV of 4. Both Chiron and Llumnix would be over-provisioned in this scenario with an average of 15 GPUs. Note that we use the tuned version of Llumnix which has similar instance-level throughput as Chiron. At 5 minutes in, the batch request queue is populated with 1 million requests. Llumnix does not enable queuing for these batch requests and immediately starts adding instances over time to reduce GPU utilization until the maximum cluster capacity of 50 instances is reached. On the other hand, Chiron would maintain batch requests in the queue and prefer to multiplex with the over provisioned capacity of 10 GPUs (out of 15 GPUs).

As batch requests have a relaxed ITL SLO, Chiron’s local autoscaler is able to maintain a higher throughput of 20 requests per second on this over-provisioned capacity. After 50 minutes, Chiron’s waiting time estimation calculates that roughly 200,000 requests still remain to be processed and 10 new instances are added to finish the queue by the deadline. At 65 minutes, all requests are completed by Chiron. As Llumnix does not adapt the batch size for the newly added instances, it continues to serve the requests at a reduced throughput. Consequently, by the deadline of 65 minutes, only 50% of requests sent through Llumnix satisfy SLOs. Overall, in this scenario, Chiron uses 60% fewer GPU node hours while meeting all SLOs.

The benefits of QLM and Chiron from multiplexing, dynamic batch sizes, and model swapping translate into reduced serving costs as shown in the figure below. The workload is sampled from the shareGPT dataset with an equal split between batch and interactive requests.

Screenshot 2025-04-18 at 11.20.39 AM.png

Date

Share