Dzung Phan, Vinicius Lima
INFORMS 2023
High throughput sequencing generates vast, high-dimensional data with extreme sparsity and noise. These characteristics pose significant challenges for conventional machine learning algorithms, which struggle to extract biologically meaningful patterns for classifying host health states. We propose a network-informed optimal transport (OT) approach, which quantifies similarities between experimental profiles. Optimal transport (OT) offers a powerful framework to address these challenges by calculating the minimum "cost" of transforming one microbial community profile into another, providing a flexible metric for comparing abundance profiles across disease states. This study systematically investigates different OT-based distance metrics—including unbalanced OT, structured OT and Gromov-Wasserstein (GW) distance—to evaluate their effectiveness in detecting disease-associated biological changes.
We apply these methods to synthetically generated networks as well as clinical datasets. To biologically inform the OT framework, we develop a custom cost function based on phylogenetic distances between features, enhancing the alignment of taxa that are evolutionarily related. This approach leverages computational interaction networks to enhance biological interpretability, enabling robust patient stratification in a disease-agnostic manner.