Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Protein data is inherently multi-modal, encompassing amino acid sequences, 3D structures, and natural language descriptions found in scientific literature. Protein language models (PLMs) specialize in learning from sequences, while large language models (LLMs) are trained on large text corpora, including scientific papers. We present a cross-modal fusion method that integrates PLMs, LLMs, and 3D structure models by enhancing each model’s input with embeddings from the others. These embeddings are projected into the target model’s latent space using lightweight adapters, combined with a specialized training protocol that aligns modalities and significantly improves performance. This approach enables PLMs to leverage structural context from 3D models, boosting performance on tasks such as antibody-antigen binding prediction. Conversely, LLMs gain biochemical grounding from PLM embeddings, enhancing tasks like domain motif identification. Our integration approach is modular, architecture-agnostic, and scalable to new models and modalities. It advances unified protein representations and enables richer reasoning across molecular and textual domains.
Cristina Cornelio, Judy Goldsmith, et al.
JAIR
Erik Altman, Jovan Blanusa, et al.
NeurIPS 2023
Pavel Klavík, A. Cristiano I. Malossi, et al.
Philos. Trans. R. Soc. A
Conrad Albrecht, Jannik Schneider, et al.
CVPR 2025