Poster

High-performance storage tier management for container-native AI workloads

Abstract

Running a distributed LLM training job with specific hardware needs (memory, GPUs) across a large number of nodes poses nontrivial resource management challenges. For example, when starting a new job, some pods may be assigned to eligible nodes already, while others are still pending for node binding. Since the job cannot proceed until all pods are ready, this results in GPU idling. The OpenShift AI deployed in Vela leverages Kueue to address this problem and improve fair sharing of resources among different tenants/namespaces. However, Kueue and similar existing workload management systems focus primarily on compute resources rather than storage. This presents a gap for storage-intensive AI workloads. LADF is implemented to automate and accelerate data loading for Vela-like Kubernetes-orchestrated high-performance distributed systems. Integrated with Kueue’s admission check, LADF intercepts workloads and evaluates data readiness. If the contents of the remote object storage bucket aren’t cached in the GPFS cluster, the STARM workload controller creates a CVO CR to trigger download via AFM. Only upon successful validation does the workload progress to pod creation and resource allocation phases.