Leo Liberti, James Ostrowski
Journal of Global Optimization
Given a set of N sequence, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. The quality of the alignment is usually measured by penalizing the mis-matches and gaps, and rewarding the matches with appropriate weight functions. However for larger values of N, additional constraints are required to give meaningful alignments. We identify a user-controlled parameter, an alignment number K (2 ≤ K ≤ N): this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. We identify a natural optimization problem for this approach called the K-MSA problem. We show that the problem is MAX SNP hard. We give a natural extension of this problem that incorporates "biological relevance" by using motifs (common patterns in the sequences) and give an approximation algorithm for this problem in terms of the motifs in the data. MUSCA is an implementation of this approach and our experimental results indicate that this approach is efficient, particularly on large numbers of long sequences, and gives good alignments when tested on biological data such as DNA and protein sequences.
Leo Liberti, James Ostrowski
Journal of Global Optimization
A. Gupta, R. Gross, et al.
SPIE Advances in Semiconductors and Superconductors 1990
Andrew Skumanich
SPIE Optics Quebec 1993
Martin Charles Golumbic, Renu C. Laskar
Discrete Applied Mathematics