Our understanding of how biological systems work is increasingly fueled by data from DNA sequencers. Sequencing has improved dramatically over the past several years, but the datasets produced by sequencers are unwieldy and difficult to interpret. This is especially true when the genome being studied contains many repeated stretches of DNA, as is the case for most mammals and plants. The goal of this project is to develop improved computational and statistical methods for analyzing DNA sequencing data, providing faster, more accurate, and more interpretable results to scientists studying organisms with repetitive genomes. These methods will be implemented as open source software tools made freely available to the research community. A successful project will result in these tools being widely adopted in the biological research community. Repetitive sequences are implicated in cellular regulation processes and associated with human disease. The integrated education plan also seeks to improve software for analyzing sequencing data by teaching computer science students the complete set of skills needed to make usable genomics software in the era of big data genomics. The PI will develop an undergraduate minor in computational biology, a graduate class covering methods for analyzing large sequencing datasets, and a competitive class project. A successful effort will result in more trained computer scientists joining and contributing to the study of computational biology and genomics.

The genomes of plants, mammals and other higher eukaryotes contain many repeated DNA sequences. 80% of the maize genome, for example, is covered by repetitive stretches of DNA. At the same time, computational tools typically model DNA as a string. This has advantages; it allows these tools to borrow methods developed for other strings, such as books and web pages, and apply them to DNA. But for repetitive genomes, the string abstraction fails to capture the prevalence of repeated DNA sequences related to each other through evolution. The PI proposes a broad research agenda based on the idea that analyzing sequencing data derived from repetitive genomes requires special, repeat-aware computational methods. The first project explores accurate and efficient methods for aligning sequence reads to repeat families. The PI proposes methods that exploit similarities between alignment problems to yield solutions that are more accurate than current approaches. The second project explores novel methods for predicting the probability that an alignment reported by a read aligner is correct, i.e., that the aligner correctly identified the read's point of origin. Downstream analysis tools use this quantity to weigh their confidence in evidence derived from the alignment. But estimating this quantity accurately is difficult, and there are no widely applicable approaches available now. The PI proposes a tandem simulation approach, whereby a simulator mimicking properties of a real dataset can provide training examples that in turn allows us to accurately predict these probabilities for real data. These methods address major deficiencies in everyday common genomics analyses, which are made slower and less accurate by repetitive DNA.

The PI will also conduct an integrated set of curriculum building and outreach efforts. These have the goal of bringing computational biology to the attention of more students earlier in their training, and to provide graduate and upper undergraduate students with a strong computational biology curriculum. Specifically, the PI will develop and implement an undergraduate minor in computational biology at Johns Hopkins University. Second, the PI will design a new graduate-level course covering contemporary methods for analyzing very large collections of sequence data. Finally, the PI will develop a competitive project called the Big Sequence Data Pentathlon that tests students' ability to design scalable, usable genomics analysis tools on a parallel computer system.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1349906
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2014-05-15
Budget End
2019-04-30
Support Year
Fiscal Year
2013
Total Cost
$545,881
Indirect Cost
Name
Johns Hopkins University
Department
Type
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218