Helgi I. Ingolfsson, Chris Neale, et al.
PNAS
Efficient exploration of synthetically accessible chemical space is a key task in data-driven hit finding and lead exploration. Large-scale accessible compound databases can be searched for similar accessible counterparts, but their scale is prohibitive. We describe a scalable workflow that utilizes the latent space of a molecular generative model to produce vector representations of the large-scale database, and leverages technology developed for retrieval-augmented generation in the language modeling domain to enable fast, similarity-based search.
Our pipeline comprises (i) a conditional variational autoencoder (CVAE) trained on a chemical corpora with conditioning signals relevant to design objectives; (ii) embedding of millions to billions of molecules from compound libraries into this latent space; (iii) vector indexing using Milvus to provide fast, approximate nearest-neighbor search; and (iv) an interface that accepts an arbitrary molecule representation and returns a set of molecules from the catalog. The retrieval operates in the learned manifold rather than hand-engineered fingerprints, so can prioritize surrogates for the input candidate molecule that are consistent with the generative model’s representations. The embedding and database creation processes are high-throughput and can be readily applied to multiple compound libraries.
We demonstrate two complementary use cases. First, given an input molecule, the system rapidly returns synthetically accessible neighbors that respect the CVAE representation of local structure-property relationships, facilitating analogue mining. Second, starting from generative proposals from a model that is not constrained to produce synthetically accessible samples, we identify readily available, nearest-neighbor surrogates that preserve latent space proximity to the original generated molecule, enabling practical experiments to be conducted.
Helgi I. Ingolfsson, Chris Neale, et al.
PNAS
Martin Zimmermann, Patrick Hunziker, et al.
Biomedical Microdevices
Edward J. Farrell, John H. Siegel
Respiration Physiology
Toby G. Rossman, Ekaterina I. Goncharova, et al.
Mutation Research - Fundamental and Molecular Mechanisms of Mutagenesis