Analysis of biological sequences, including multiple sequence alignment, motif finding, and genome alignment, is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. Most of the sequence analysis problem formulations (particularly those related to alignment) are considered NP-hard. Existing solutions to the sequence alignment problem (both sequential as well as parallel) are extremely limited in their applicability and yield poor performance for large data sets. Moreover most of these solutions have been designed for aligning short length sequences. The genome alignment problem (very long sequences) is significantly harder and very few solutions exist that are capable to construct genomes from short reads while taking significant amount of execution time. This project deals with the design and development of high performance algorithms and implementations for aligning genomes using innovative sampling and domain decomposition strategies. This approach has never been pursued for genome alignment in the past. The proposed algorithms are implemented on hybrid computing platforms consisting of multicore clusters and GPU units.
This project brings together tools and applications from multiple disciplines such as bioinformatics, computational biology, statistics, and high performance computing. Therefore the findings will introduce new tools for biology and biomedical applications. It will facilitate rapid reconstruction of genomes and mapping of short reads to the corresponding haplotypes.