M. Sprik, U. RÖTHLISBERGER, et al.
Molecular Physics
Foundation models have transformed the execution of many tasks and are an area of active exploration in small molecule drug discovery. Typically, small molecule foundation models focus on a single representation of the molecule, such as a SMILES string input into a text-based model. However, molecules may be represented in numerous ways including as images, chemically bonded graphs, or three-dimensional structures. Each representation or ‘view’ contains different, potentially complementary information that if combined can yield a more accurate and robust model. Here we describe a multi-view foundation model that incorporates several pre-trained representations to achieve this goal. Each view has already been pre-trained on hundreds of millions of molecules. Complementarity of representations in embedding space is evaluated. We explore multi-modal, late fusion techniques and fine-tune our models on datasets covering a large variety of downstream tasks. We find that our multi-view models can overall outperform models reliant on a single representation.