High-throughput and deep coverage sequencing enabled by next generation DNA sequencing systems is revolutionizing life sciences research. However, the prevalence of sequencing errors produced by such technology hinders a realization of the full potential of the data it produces, particularly in mixed samples of unknown origin. The overarching goal of the proposed project is to address a class of next generation sequencing problems that arise in the context of bio-threat detection by developing mathematically and statistically rigorous models to interpret, assemble and analyze sequencing data along with computationally efficient algorithms to solve them. The researchers propose a mathematical framework where the original sequence is observed with errors, and will accomplish the following specific objectives 1) Develop a generalized hidden Markov models for sequencers and efficient algorithms for parameter estimation. 2) Develop fast and accurate error correction schemes for the reads, jointly with accurate genome assembly via mutual reinforcements between them. 3) Develop methods for identifying bio-threats within a sample mixture via metagenomic analysis. 4) Develop methods for identifying genomic variations in a bio-threat organism with respect to a reference genome. Deep coverage provides redundancy that allows inference along with confidence assessment, so true variation can be distinguished from error, which will be essential for novel bio-threat detection.
A novel bio-threat will likely first be detected as a never before seen genome in a patient or environmental sample containing much other genetic material. We may know nothing about the type of organism of the bio-threat, yet it may be highly similar to other organisms that are not bio-threats. Next generation sequencing makes it possible to quickly read the genetic content of samples, but the mathematical models and algorithms to process the data lag far behind. In particular, next generation sequencing machines tend to introduce errors at a high rate, producing genetic variation very difficult to distinguish from the true variation of a novel bio-threat. The PIs propose to build rigorous statistical models and rapid algorithms to detect and correct errors while reconstructing the genomes contained in the sample. Error correction is needed to reconstruct the true genomes, but also to speed up the reconstruction algorithms, and will play a fundamental role in quickly producing accurate complete sequences.