This project develops new software tools for pangenomic analysis, which is a relatively new area of genomic research that studies large numbers of genome sequences from multiple organisms to understand how organisms adapt their genomes to their environments. As the cost of DNA sequencing continues to decrease, it is now routine for multiple genomes per species to be available for analysis, giving much more information about the species. The approach makes use of a graph-based representation of a pangenome and exploits this representation to efficiently find both shared and unique regions of interest across genomes. Each individual?s genomic sequence corresponds to path in a graph data structure called a De Bruijn graph; these graphs are large and can have millions of nodes and edges. The tools being developed are based on finding frequented regions (FRs) in De Bruijn graphs; these regions are hotspots that often represent features of interest in one or more genomes. Algorithms and software tools will be made available to the greater scientific community to facilitate new pangenomics research. The project will provide support and training for a postdoc and an incoming PhD student at Montana State University. It will also support a summer intern in the last two years at the National Center for Genome Resources. Aspects of the project will be incorporated into undergraduate and graduate courses at MSU, as well as integrated into several outreach and training activities at NCGR. In addition, MSU has several programs in place to serve American Indian students and the PIs will actively recruit from and engage this community.
The current trajectory of next generation sequencing improvements, including falling costs and increased read lengths and throughput, ensure that multiple genomes per species will be routine within the next decade. This project initiates work on a next generation of bioinformatics software that can exploit the increased information content available from multiple accessions and intelligently use the data for unbiased, species-wide analyses. The proposed work will refine algorithms and develop software to address important problems in each of the identified areas. The research team has a variety of complementary expertise ranging from molecular biology, algorithms, machine learning and genomics research. Pangenomic biology will be advanced through automatic identification of candidate regions of interest in a pangenome. Methods will be developed to discover regions that are conserved across evolutionary space, regions that are novel, and regions that have diverged due to positive selection. Machine learning techniques will be used to search for interesting genomic regions. Lastly, this work will complement the work being done on the model plant, Medicago truncatula, contributing to research on its symbiotic relationships. Results of the project can be found at: www.cs.montana.edu/pangenomics.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.