The field of DNA sequencing has rapidly evolved from Sanger sequencing (700-1000 base pairs, or bp, in length) to massively parallel high-throughput sequencing of short reads (100-200 bp) to the more recent advances in generating long (> 5,000 bp) and ultra-long (> 50,000 bp) reads. There is currently an urgent need to develop efficient algorithms for analyzing long-read datasets in the context of the myriad biological applications they enable. Long read technologies sustain high error rates but with more attractive error distribution characteristics that often permit probabilistic guarantees on the quality of results. This project will result in fundamental research advances in developing bioinformatics algorithms for long-read sequencing, along with distributable open-source software products to facilitate their immediate adoption by the life sciences community. The award will also support interdisciplinary training, and undergraduate participation in research.

The project seeks to advance mapping, assembly, and biological applications of long-read sequencing through the design of provably efficient algorithms, formal characterization of the quality of results, development of methods that scale to larger datasets, and methods that are robust to changes brought about by continued developments in long read sequencing. Problems addressed include (i) split-read mapping of ultra-long reads to a reference genome, (ii) development of data structures based on bloom filters to achieve space optimization and perfect statistical sensitivity, (iii) algorithms for mapping long reads to a collection of reference genomes represented by compact graph-based structures, (iv) algorithms for partitioning long reads to facilitate identification of haplotypes in diploid assemblies, and (v) long-read inspired alignment free algorithms for genome-to-genome comparison, as well as important biological applications enabled by these. The research will emphasize utilization of real datasets, relevance to practically encountered problems and applications, and independent validation by collaborators and other experts.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2018-10-01
Budget End
2021-09-30
Support Year
Fiscal Year
2018
Total Cost
$424,992
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332