Conference paper

Dataset Definition in AI for Chemistry: Insights from Data Science Practitioners in Battery and PFAS Research

Abstract

Understanding how chemistry experts engage with data science workflows is essential for advancing AI-driven research in materials science and environmental chemistry. This research, conducted at IBM Research, explores how chemists, computational scientists, and data scientists define datasets for fine-tuning foundation models—large-scale machine learning models increasingly used in chemical discovery and analysis. The research focuses on two critical domains: battery and PFAS (Per- and Polyfluoroalkyl Substances) materials.

We conducted two user studies involving four participants with chemistry and data science expertise: a chemist and material scientist (P2), two computational chemists (P3 & P4), and a domain-agnostic data scientist (P1) who collaborated closely with the others. Study A used open-ended interviews to explore workflows and challenges, while Study B employed structured interviews to map the dataset definition process for downstream tasks, such as molecular screening and predictive modeling.

The studies revealed varied approaches to dataset creation and use. P2 integrates experimental data with computational workflows, often generating custom datasets in the lab. P3 and P4 focus on simulation data and lightweight analytics, while P1 contributes model-centric strategies using literature-based datasets. These datasets are either publicly available or derived from public sources, utilizing domain expertise.

Study B identified four key stages in dataset definition: (1) identifying candidate datasets, (2) verifying scientific rigor, (3) validating with domain experts, and (4) creating new datasets when needed. These stages are iterative and require careful tracking of data provenance, transformations, and the decisions made.

Key insights include the importance of dataset traceability, the role of domain expertise in validation, and the challenges of managing hybrid workflows across local and cloud environments. The proposed Discovery Workbench workflow supports researchers by capturing metadata and rationale throughout the dataset lifecycle—even when datasets are discarded or redefined.

This work highlights the central role of dataset definition in AI for chemistry, offering actionable insights for developing tools that promote reproducibility, traceability, collaboration, and scientific rigor in data-driven chemical research.