Poster

Optimal Transport on structured data for improved classification of host health from omics data

Abstract

High throughput sequencing generates vast, high-dimensional data with extreme sparsity and noise. These characteristics pose significant challenges for conventional machine learning algorithms, which struggle to extract biologically meaningful patterns for classifying host health states. We propose a network-informed optimal transport (OT) approach, which quantifies similarities between experimental profiles. Optimal transport (OT) offers a powerful framework to address these challenges by calculating the minimum "cost" of transforming one microbial community profile into another, providing a flexible metric for comparing abundance profiles across disease states. This study systematically investigates different OT-based distance metrics—including unbalanced OT, structured OT and Gromov-Wasserstein (GW) distance—to evaluate their effectiveness in detecting disease-associated biological changes.

We apply these methods to synthetically generated networks as well as clinical datasets. To biologically inform the OT framework, we develop a custom cost function based on phylogenetic distances between features, enhancing the alignment of taxa that are evolutionarily related. This approach leverages computational interaction networks to enhance biological interpretability, enabling robust patient stratification in a disease-agnostic manner.