Conference paper

Dynamic Multimodal Fusion for Robust Molecular Representation Learning

Abstract

Molecular representation learning has become a cornerstone of computational chemistry, enabling models to predict molecular properties, facilitate drug discovery, and guide material design. Traditional approaches typically rely on a single molecular modality—such as SMILES strings or molecular graphs—limiting their ability to capture the full complexity of molecular structure and behavior. While pretrained unimodal models have made progress by learning generalized representations from large-scale datasets, their effectiveness is limited by data sparsity, modality-specific biases, and the inability to fully exploit complementary information across multiple molecular views. To address these challenges, we explore a multimodal learning framework that integrates complementary molecular modalities to enhance representation quality. Conventional multimodal fusion methods, including early and late fusion, suffer from limitations such as scalability, dependence on complete multimodal datasets, and difficulties in optimally weighting modality contributions. Naïve fusion approaches like embedding concatenation often result in high-dimensional, redundant representations and are ill-suited for handling missing data—a common scenario in real-world datasets. Advanced attention-based fusion methods, while more expressive, introduce significant computational overhead and remain sensitive to modality imbalance. In this work, we propose a dynamic multimodal fusion framework tailored to adaptively select and integrate the most informative features across available modalities, while remaining resilient to incomplete data—a common issue in practical chemical datasets. To ensure scalability and generalizability, our method leverages pretrained encoders for feature extraction and introduces a fusion mechanism that dynamically modulates intra- and inter-modal interactions. This allows the model to suppress noise, avoid redundancy, and enhance robustness without depending on fully paired modality inputs. Our preliminary evaluations highlight the promise of dynamic, modality-aware fusion strategies in molecular machine learning, offering a flexible and scalable path forward for real-world chemical property prediction and molecular discovery. By shifting the focus from rigid fusion architectures to flexible, data-aware integration strategies, this work aims to establish a foundation for more reliable and efficient multimodal molecular representation learning.

Related