Short paper

FLYT: Transparent and Elastic GPU Provisioning for Multi-Tenant Cloud Services

Abstract

Modern cloud services such as AI inference, video analytics, and scientific computing exhibit highly variable and bursty GPU demand patterns that static provisioning and coarse-grained sharing mechanism struggle to accommodate efficiently. Existing GPU multiplexing approaches, including NVIDIA MPS and MIG, provide limited flexibility in multi-tenant environments, often leading to resource fragmentation, under-utilization, or unpredictable latency. We present Flyt, a transparent, latency-aware GPU orchestration framework for virtualized cloud services. Flyt enables fine-grain runtime scaling of Streaming Multiprocessors (SMs) and breaks the traditional VM–GPUs binding by allowing applications inside a VM to execute on different GPUs over time. This design supports elastic scaling and live inter–node GPU migration without application or guest OS modifications, by virtualizing GPU memory through address translation and enforcing elastic SM execution caps. An evaluation on heterogeneous GPUs using TorchServe and Rodinia benchmark applications demonstrates that Flyt maintains predictable bounded latency under dynamic workloads while significantly improving GPU utilization compared to static provisioning. In co-located VM–GPU deployments using shared-memory transport, Flyt achieves performance within 12–15% of native execution for most workloads while providing latency isolation and elasticity under contention, demonstrating that elastic SM allocation can maintain latency targets under bursty load without hardware partitioning.