Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
Training-as-a-service platforms facilitate users to deploy pre-configured Generative AI training jobs as batch workloads. The immutability of configuration offers minimal flexibility to dynamically adapt to training progress. Existing approaches invariably involve manually monitoring training progress on a dashboard, and the stop-reconfigure-restart of training does not scale well with number of experiments. Relying on pre-configuration, wastes computational resources and makes debugging of training jobs difficult. We address this gap through our training-control-as-code paradigm, which allows users to run user-defined code to analyze the training state and intervene to flag anomalies and save resource wastage. Our framework TrAC offers a declarative interface to allow for declaring desired control and for reusing it at scale. Using real-world open-source data and models we provide estimates on the savings in time and resource due to TrAC. We also provide demo video: https://youtu.be/RmhBfFjd1oA and code: https://github.com/foundation-model-stack/fms-hf-tuning/ blob/main/examples/trainercontroller configs/Readme.md
Marcelo Amaral, Tatsuhiro Chiba, et al.
CLOUD 2022
Pranjal Gupta, Karan Bhukar, et al.
ICPE 2025
Abhishek Malvankar, Olivier Tardieu
KubeCon EU 2024
Darya Kaviani, Sijun Tan, et al.
RWC 2025