Indra Priyadarsini S, Seiji Takeda, et al.
ACS Fall 2025
Most large-scale chemical language models are trained on a single textual molecular representation using self-supervised learning over large unlabeled corpora. These models excel in tasks such as property prediction and molecule generation by learning contextualized representations of input tokens. However, relying solely on one representation may result in the loss of structural or semantic information captured by alternative formats and may limit the model's ability to generalize across diverse molecular encodings. To address this limitation, we incorporate multiple textual molecular representations—including molecular formula, IUPAC name, International Chemical Identifier (InChI), SMILES, and SELFIES—into a unified vocabulary to harness the unique strengths of each format. Here, we introduce a large encoder-decoder chemical foundation model based on the Transformer architecture, designed to support multi-representational inputs. The model is pre-trained in a BERT-style on 117 million molecules for each representation, sourced from PubChem, resulting in a corpus of approximately 35 billion molecular tokens. These models servers as a foundation for language chemical research in supporting different complex tasks, including molecular properties prediction, classification, and reconstruction. Furthermore, studies of the multi-textual molecular latent space indicate cross-representation alignment and reveal how different textual encodings of the same molecule can converge toward a unified semantic representation. This shared space may facilitate deeper insights into molecular structure, enhance generalization, and support a broad range of downstream applications.
Indra Priyadarsini S, Seiji Takeda, et al.
ACS Fall 2025
Partha Suryanarayanan, Shreyans Sethi, et al.
ACS Fall 2025
Jerng-Sik Song, Chin-An Chang
Journal of Vacuum Science and Technology A: Vacuum, Surfaces and Films
M.L. Hildner, K. Johnson, et al.
Surface Science