High-throughput sequencing is transforming the field of population genomics. The cost of obtaining an individual's genome via sequencing has dropped several orders of magnitude during the past decade, and may reach the so-called $1,000 genome target within the next few years. Now much attention has shifted to sequencing on a population scale. Large amount of population sequencing data has already been generated. Therefore, there is an urgent need for the development of new computational methods that work with noisy, high-throughput sequencing data to provide efficient and accurate analysis for important population genomics problems.
The intellectual merits of the work include the development of accurate computational methods that are capable of analyzing large-scale high-throughput sequencing data for several population genomics problems. Problems of interest include inferring genotypes, correcting sequencing errors and detecting meiotic recombination, as well as searching for disease-causing rare gene variants and other emerging applications of high-throughput sequencing. A key difference between the proposed research and many existing methods is that the proposed approaches are explicitly designed for processing large amount of high-throughput sequencing data. One particular focus is on applying combinatorial optimization techniques such as integer linear programming, which is not well-known to biologists. Probabilistic models will also be used and integrated with optimization approaches to provide efficient and accurate solutions. The expected project outcome includes efficient algorithms for the above population genomics problems, related open-source software tools, and rigorous methodologies for both theoretical and empirical evaluation of the algorithms.
Part of the contribution of this work to computer science is that the study of algorithms for handling short sequencing reads may contribute to the research of string matching algorithms, a problem of general interests in computer science. Noisy sequencing data motivates naturally approximate string matching and may lead to new string-based problem formulations. Due to the need of efficiency, algorithmic string processing techniques may play an important role in the proposed research. Other aspects of the proposed work are related to phylogenetic problems, which have been actively studied in computer science. Theoretical study on these algorithmic problems will be conducted to obtain rigorous results that may be of interest to computer science research community.
The broader impacts of the project include interdisciplinary collaboration and training, as well as educational impacts. The developed software tools will be made available freely to the multi-disciplinary research community, and are expected to enable novel biological applications of high-throughput sequencing. The PI will develop an interdisciplinary undergraduate and graduate educational curriculum at University of Connecticut. The proposed educational and outreach activities include reaching out to students with various backgrounds, and training of future researchers with unique interdisciplinary skills.