Workshop paper

To Virtualize or Not to Virtualize: Experiences from Building Two Generations of Virtualized Infrastructure for LLM Training

Abstract

Large Language Model (LLM) training workloads share computational characteristics with high-performance computing applications, requiring intensive parallel processing, complex matrix operations, and distributed computing with frequent synchronization -- requiring specialized hardware to deliver optimal performance.

This talk presents insights from Vela, a cloud-native system architecture introduced in 2021 for LLM training using commercial hardware and open-source software. The Vela architecture combines off-the-shelf hardware, Linux KVM virtualization with PCIe passthrough, and virtualized RDMA over Converged Ethernet networks. The system employs software-defined networking with SRIOV technology for GPU Direct RDMA, achieving near-bare-metal performance while maintaining virtualization benefits.

Based on multiple data center deployments and iterations, we present two case studies examining what it takes for virtualization-based systems to deliver (a) bare-metal RoCE-like performance and (b) bare-metal InfiniBand-like performance for LLM training workloads. The discussion focuses on virtualization challenges, experiences, and runtime optimizations required for optimal performance in cloud-native training infrastructure.