Nicholas Nordlund, Vassilis Vassiliadis, et al.
CLOUD 2021
Training-as-a-service platforms facilitate users to deploy pre-configured Generative AI training jobs as batch workloads. The immutability of configuration offers minimal flexibility to dynamically adapt to training progress. Existing approaches invariably involve manually monitoring training progress on a dashboard, and the stop-reconfigure-restart of training does not scale well with number of experiments. Relying on pre-configuration, wastes computational resources and makes debugging of training jobs difficult. We address this gap through our training-control-as-code paradigm, which allows users to run user-defined code to analyze the training state and intervene to flag anomalies and save resource wastage. Our framework TrAC offers a declarative interface to allow for declaring desired control and for reusing it at scale. Using real-world open-source data and models we provide estimates on the savings in time and resource due to TrAC. We also provide demo video: https://youtu.be/RmhBfFjd1oA and code: https://github.com/foundation-model-stack/fms-hf-tuning/ blob/main/examples/trainercontroller configs/Readme.md
Nicholas Nordlund, Vassilis Vassiliadis, et al.
CLOUD 2021
Weichao Mao, Haoran Qiu, et al.
NeurIPS 2023
Gal Amram, Ora Nova Fandina, et al.
ASE 2025
Yue Zhu, Chen Wang, et al.
MASCOTS 2024