Poster

Cross-modal fusion of protein language models, large language models, and 3D structural models for unified protein representations

Abstract

Protein data is inherently multi-modal, encompassing amino acid sequences, 3D structures, and natural language descriptions found in scientific literature. Protein language models (PLMs) specialize in learning from sequences, while large language models (LLMs) are trained on large text corpora, including scientific papers. We present a cross-modal fusion method that integrates PLMs, LLMs, and 3D structure models by enhancing each model’s input with embeddings from the others. These embeddings are projected into the target model’s latent space using lightweight adapters, combined with a specialized training protocol that aligns modalities and significantly improves performance. This approach enables PLMs to leverage structural context from 3D models, boosting performance on tasks such as antibody-antigen binding prediction. Conversely, LLMs gain biochemical grounding from PLM embeddings, enhancing tasks like domain motif identification. Our integration approach is modular, architecture-agnostic, and scalable to new models and modalities. It advances unified protein representations and enables richer reasoning across molecular and textual domains.