Demo paper

Training-Control-as-Code: Towards a declarative solution to control training

Abstract

Training-as-a-service platforms facilitate users to deploy pre-configured Generative AI training jobs as batch workloads. The immutability of configuration offers minimal flexibility to dynamically adapt to training progress. Existing approaches invariably involve manually monitoring training progress on a dashboard, and the stop-reconfigure-restart of training does not scale well with number of experiments. Relying on pre-configuration, wastes computational resources and makes debugging of training jobs difficult. We address this gap through our training-control-as-code paradigm, which allows users to run user-defined code to analyze the training state and intervene to flag anomalies and save resource wastage. Our framework TrAC offers a declarative interface to allow for declaring desired control and for reusing it at scale. Using real-world open-source data and models we provide estimates on the savings in time and resource due to TrAC. We also provide demo video: https://youtu.be/RmhBfFjd1oA and code: https://github.com/foundation-model-stack/fms-hf-tuning/ blob/main/examples/trainercontroller configs/Readme.md