Paper

An open-source family of large encoder-decoder foundation models for chemistry

Abstract

The use of foundation models has extended from natural language processing to molecular modeling. In this context, large-scale pre-training strategies have been applied to chemical language models to enable representation learning across diverse tasks. Here we introduce a family of encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million molecular sequences fromPubChem. These models support a range of applications, including property estimation and reaction outcome prediction. We evaluate two model variants across several benchmark datasets and show that they match or exceed existing approaches. We also assess the structure of the learned representations and find that the embedding space supports few-shot learning and separates molecules based on chemically relevant features. This structure appears to result from the decoder-based reconstruction objective used during pre-training. These findings suggest that the proposed models can serve as general-purpose tools for molecular analysis and reasoning with minimal supervision.

Related