The project will develop rigorous and fast (practical) graph-theoretic genome-wide algorithms associated with genome sequence data and graphs within an integrated set of research and educational activities designed to intertwine synergistically with statistical modeling; apply these to the solution of real-world problems and long-standing computer science and mathematics questions that impact molecular biology; and make these techniques accessible to students, researchers, and practitioners in the corresponding fields. The project focuses on algorithms for haplotype reconstruction from genome sequence data of various organisms and genomes having different polyploidy number, i.e., number of haplotypes. The mammalian genomes, human included, are diploid genomes. However, the polyploidy number varies across the life spectrum from 1 (bacteria) to more than 100 (adder's-tongue fern). The HapCompass graph-theoretic framework associates genome sequencing read mappings with the genome-wide SNP data in the human genome.The HapCompass algorithm uses local optimization algorithms on the spanning tree cycle basis of HapCompass graphs for objective functions that model various measures of error correction in haplotype reconstruction. Earlier work provided a combinatorial framework for error-correction models for diploid haplotype assembly, a framework embraced by the large literature on the topic in the following decade. The fundamental case of diploid genomes and the HapCompass graph theory, data structures and algorithms set the stage for generalizations to polyploidy and integrative genome-wide haplotype reconstruction in samples of individuals that share haplotypic regions identical by descent. In addition, curriculum development is plan that covers the topics from both a computing and the user perspective. All materials and software, source code, and documentation will be available. Software will rely on the open-source model.
This project intertwines computer science with statistical models and an impact in genomics and molecular biology. Faster and more accurate algorithms for diploid haplotype assembly based on multi-criteria optimization using multiple models of error correction on the spanning-tree cycle bases of compass graphs will be developed. In addition, robust generalizations of the graph theory and algorithms for polyploid genomes that permit a common algorithmic strategy are planned. The team also is developing an optimal linear time algorithm for shared IBD tract identification in the case of phased genotype data, and efficient and exact algorithms for IBD haplotype tract identification for shared IBD tracts in unphased genotype data that are linear in the number of haplotypes and subquadratic in the number of genotypes to enhance efficiency. The final product of the work will be an algorithmic framework for genome-wide haplotype reconstruction created by combining the new algorithms and Clark Consistency Graph data structures and algorithms.