As DNA sequencing continues to improve in quality and decrease in cost, the volume of genomic data continues to increase at a geometric rate. In particular, multiple genomes from the same or related species are actively being acquired; this has lead to the new research field of pangenomics. Pangenomics sheds new light on how organisms adapt to their environment and what genomic features vary or stay the same within related species. This project deals with some of the computational challenges associated with efficiently storing and querying large pan-genomic data sets. New and enhanced data structures will be developed that efficiently describe the genomes of closely related lines or species and permit fast information retrieval for common activities such as sequence searching. The investigators will develop software tools that will be made freely available to the general scientific community and test and validate these tools using several important species, including Saccharomyces cerevisiae (a yeast species), Arapidopsis thaliana (thale cress, a small flowering plant) and Medicago truncatula (barrel clover, a small legume).

This project will initiate work on a next generation of bioinformatics algorithms that can exploit the increased information content available from multiple accessions and intelligently use the data for unbiased, species-wide analyses. The current trajectory of next generation sequencing improvements, including falling costs and increased read lengths and throughput, ensure that multiple genomes per species will be routine within the next decade. The researchers will investigate improvements to graphical data structures such as deBruijn and string graphs, used to represent pan genomes and develop associated algorithms for querying pan-genomic data, building on data structures such as the FM-index. Saccharomyces cerevisiae, Arapidopsis thaliana and Medicago truncatula will be the principal model organisms for this study, listed in order of complexity. Multiple genomes of each of these organisms have been sequenced so they are good candidates for pangenomic study. In particular, this project will complement existing work on symbiotic relationships of M. truncatula. This work will also shed light on the impact of reference bias in genomic analysis and provide alternative routes to avoid it. It will provide a practical method for pan-genomic sampling that provides a single FASTA sequence that minimizes reference bias in downstream analysis. This research will lead to new software tools that will be made available to the community and will be used to engage students in bioinformatics research and educational activities.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1542262
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2015-07-15
Budget End
2018-06-30
Support Year
Fiscal Year
2015
Total Cost
$240,526
Indirect Cost
Name
Montana State University
Department
Type
DUNS #
City
Bozeman
State
MT
Country
United States
Zip Code
59717