Conference paper

From Natural Language to Executable ETL Flows: The IBM DataStage Assistant

Abstract

Modern ETL (Extract, Transform, Load) tools offer graphical, no-code interfaces for workflow creation but still require users to manually identify transformation functions and configure their properties, which is time-consuming and demands prior expertise. We present the research and engineering foundations of the IBM DataStage Assistant, a deployed capability that generates complete multi-stage ETL flows directly from natural language (NL) descriptions. Our framework infers transformation functions, their properties, and transformer expressions, enabling novices to discover relevant functions and allowing experts to bypass manual configuration. The proposed framework achieves a prediction accuracy of 96.4%96.4\% for flow predictions, 87.0%87.0\% for properties, and 83.6%83.6\% for transformer expressions. We also show a document exploration module that uses retrieval-augmented generation (RAG) over product documentation to answer tool-specific questions in NL. Implemented in IBM DataStage, this approach supports iterative, in-environment workflow design and reduces context switching. In initial studies, it achieves up to 90%90\% time savings for novices and 50%50\% for experts.