In the past decade, there has been an effort to sequence and compare the DNA of a large number of individuals of a given species, resulting in not just a single reference genome but a population of genomes of a given species. Enormous public data now are available including the 1,000 Genome Project, the 100K Genome Project, the 1001 Arabidopsis Genomes project, the Rice Genome Annotation Project, and the Bird 10,000 Genomes (B10K) Project. Key software, called short read aligners, align newly sequenced DNA fragments to one (or more) reference genome(s) in order to identify genetic variation within the species. The downstream analysis of this genetic variation finds causal relationships between complex diseases and phenotypes. Existing short read aligners are unable to align to a large number of reference genome(s), due purely to computational constraints. Hence, using a small number of genome(s) to align to reduces the memory and time constraints. Unfortunately, although there is a large percentage genetic similarity between individuals of the same species, the differences are also important and aligning to only a small number of genomes of a given species can lead to some of the DNA fragments not aligning or aligning poorly. This, in turn, makes finding genetic variation between the newly sequenced DNA fragments and the reference genome(s) more challenging. One manner to overcome this challenge is to develop new algorithms and data structures for short read alignment that reduce the computational resources. This project realizes this vision by developing a novel representation of a population of genomes, and creating the algorithms and data structures needed to build, store and update it. Thus, integrated into this project is the goal of advancing biological science and knowledge of model species, and the ideas, and furthering the development of an outreach program that supports first-generation university graduates. An immediate outcome of the work will be research opportunities to under-served students through the Machen Florida Opportunity Scholars program, an organization that aims to foster the success of first-generation university scholars.

Short read aligners first build an index from one or more reference genome(s) and subsequently use it to find and extend matched subsequences between sequence reads and the reference(s). The bottleneck of using these read aligners to index thousands of genomes is the space and time needed for construct and store the index. To address the shortcomings associated with using a single reference genome, the concept of graph-based pangenomics aligners has been introduced and widely discussed in the community. Although such methods have been shown to improve on the accuracy over standard sequence-based aligners, their use has not been fully explored. The challenge that prevents the realization a pangenomics graph alignment is that of scalability. The goal of the project is to the developing algorithms that allow for the construction of a pangenomic reference from datasets gathered from large populations. In order to achieve this goal, novel means to build, compress, and update a graph that encapsulates the variation found in the population will be created and implemented. Thus, this work will require further advancements that have impact beyond the stated application. More specifically, it is unknown how to merge the r-index, represent a graph-model of references using sub-linear space, or represent the graph using the r-index. This project will address these open problems, and more broadly, connect two areas of research: succinct data structures and pangenomics. Next, the project will minimize the conceptual gap between compression and mutability. The research community has struggled with the balance between compression and mutability since highly compressed data structures are not able to be altered without reconstruction. This poses unduly constraints when trying to apply these structures to biological datasets that routinely get updated with new data. This project will make significant developments in this area by developing compressed data structures that are mutable for our realization of our pangenomics index. Project website: www.christinaboucher.com/pangenomics-iibr

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
2029552
Program Officer
Jean Gao
Project Start
Project End
Budget Start
2020-09-01
Budget End
2023-08-31
Support Year
Fiscal Year
2020
Total Cost
$700,361
Indirect Cost
Name
University of Florida
Department
Type
DUNS #
City
Gainesville
State
FL
Country
United States
Zip Code
32611