Poster

From Device Passthrough to Host Passout: Exploring RAS Risks in {High-Performance, Virtualized} AI-Systems

Abstract

Modern cloud platforms increasingly rely on PCIe device passthrough to meet the performance demands of tenants using accelerators and high-speed storage. Although the functional correctness of PCIe passthrough is well studied, its implications for system reliability, availability, and serviceability (RAS) remain underexplored. In this work, we investigate how exposing the extended configuration space and device-specific registers of passthrough PCIe devices to virtual machines (VMs) can introduce instability in the host system. Through empirical analysis, we identify several failure modes—including device resets and host crashes—that can be triggered by legal but unexpected accesses to the PCI configuration space of a passthrough device from a guest VM. These issues undermine reliability and availability guarantees and increase the operational burden for cloud providers.

As our threat model, we consider a cloud environment using the KVM+QEMU virtualization stack, in which at least one PCIe device is passed through to a guest VM. These guest VMs emulate a modern Q35 chipset, enabling PCIe support and access to the extended capabilities of the passed-through devices. We assume that the cloud environment follows established best practices for configuring and managing PCI passthrough devices, as outlined in resources such as the OpenStack Security Guide [1] and the Red Hat Security Guide [2]. Despite these configurations, our initial experiments have revealed critical RAS-related failure modes in more than five PCIe devices across common AI system components—including GPUs, NVMes, and NICs.

Addressing these RAS challenges can be approached at different layers of the virtualization stack, including the hypervisor, the host operating system, and the PCIe device itself. In this poster, we will discuss the pros and cons of tackling these issues at each layer. We will also explore why such issues arise in modern systems and how they can be systematically prevented in future designs.