Next-generation sequencing (NGS), which allows sampling millions of short DNA sequences from a genome, has revolutionized the field of genomics. One area of particular importance is the reconstruction of genomes (haplotypes) from a viral population, which is a fundamental problem in virology, evolutionary biology, and human health. Though there have been several methods developed to take advantage of NGS data, those are limited to populations for which a reference genome is available. This excludes many important cases, such as RNA viruses or certain HIV/HCV viral populations. In such situations, the haplotypes are sufficiently divergent as to render the reference meaningless. Moreover, most algorithms are not robust in the presence of recombination, which is a common occurrence in many viral populations. The achievement of this project's aims will allow for the full potential of NGS data to be realized in the field of virology. In particular, it will help to propel the understanding of viral population dynamics and give biologists powerful tools to understand disease progression and enable novel treatment and prevention strategies. The algorithms and software developed will be made freely available for use through software sharing platforms like GitHub or Galaxy. The PIs will offer a strong educational component including (a) graduate and undergraduate classes that use the output of the proposed research, and (b) development of a seminar series. The PIs will (a) train future generations of scientists and engineers to enhance and use bioinformatic/genomic cyber resources; (b) facilitate creative, cyber-enabled boundary-crossing collaborations, including those with industry and international dimensions, to advance the frontiers of science and engineering and broaden participation in STEM fields.

This project?s aim is to develop probabilistic De Bruijn graphs and network flow on such graphs for the reconstruction of viral population when a reference is not available. Given NGS data, the algorithms should determine the number, sequences, and relative frequencies of the haplotypes. This project's proposed algorithms are based on a unique combination of established techniques (e.g. maximum likelihood, expectation-maximization, clustering, Lander Waterman statistics) with novel propositions for probabilistic De Bruijn graphs, machine learning, and network flows that are of interest in other applications. The PI and Co-PIs have complementary backgrounds in virology, machine learning, network flow, and genome reconstruction problems.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1421908
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2014-09-01
Budget End
2017-05-31
Support Year
Fiscal Year
2014
Total Cost
$500,000
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802