Modern DNA sequencing technologies have revolutionized the design of experiments investigating the biology of the genome and the genetic basis of traits. Arguably the most powerful application of these technologies has been the creation of exquisitely detailed catalogs describing the landscape of genetic variation in multiple species. However, discovery of genetic variation is merely the beginning. Exploration and analysis of the resulting catalogs is required to catalyze new insights into the relationship between genotype and phenotype. This proposal is motivated by two fundamental limitations inhibiting discovery from genetic variation datasets. First, existing software for mining variation to understand disease and other traits does not scale to large datasets involving thousands of samples. Second, most existing tools are focused on human studies; consequently, this inhibits the application of modern DNA sequencing to genetic studies of model organisms, livestock genetics, and newly sequenced species. We propose to solve these challenges by building upon our GEMINI framework. Since 2012, we have maintained GEMINI as a powerful software framework for exploring genome variation. GEMINI's strength is that it integrates genetic variation with a diverse set of genome annotations into a database to facilitate variant prioritization. It allows researchers to conduct complex analyses with simple queries based on sample genotypes, phenotypes, inheritance patterns, and genome annotations. GEMINI has quickly become a very popular tool for rare human disease research leading to discoveries by multiple labs, including our own. Despite its power and popularity, GEMINI has three important limitations. It was not designed for studies involving genetic variation from more than a few hundred samples. Furthermore, its focus is the analysis of single-nucleotide (SNP) and insertion-deletion (INDEL); it is blind to structural and copy number variation. Finally, GEMINI can only analyze genetic variation datasets for the human genome; no other species or genome builds are supported. Therefore, this proposal seeks to provide geneticists studying any species with a powerful, flexible and simple to use software system that is fast and scalable enough to support genetic research for many years to come. We will do this but achieving the following Specific Aims: (1) Develop a scalable, high performance genotype and haplotype query engine to empower large scale genome studies. (2) Devise new methods for genotyping, integrating and prioritizing structural variation. (3) Enable scalable, flexible genome analysis in any species and genome build. In summary, by completing these aims, the proposed research will provide geneticists studying any species with a powerful, flexible and simple to use software system that is fast and scalable enough to support genetic research for many years to come.

Public Health Relevance

Arguably the most powerful application of modern DNA sequencing technologies has been the creation exquisitely detailed catalogs that describe the landscape of genetic variation in multiple species. However, discovery of genetic variation is merely the beginning; exploration and analysis of the resulting catalogs is required to catalyze new insights into the relationship between genotype and phenotype. This proposal seeks to provide geneticists studying any species with a powerful, flexible and simple to use software system that fast and scalable enough to support genetic research for many years to come.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM124355-04
Application #
9984424
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Ravichandran, Veerasamy
Project Start
2017-08-01
Project End
2021-07-31
Budget Start
2020-08-01
Budget End
2021-07-31
Support Year
4
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Utah
Department
Genetics
Type
Schools of Medicine
DUNS #
009095365
City
Salt Lake City
State
UT
Country
United States
Zip Code
84112
Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya et al. (2018) GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods 15:123-126
Pedersen, Brent S; Quinlan, Aaron R (2018) Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34:867-868
Ostrander, Betsy E P; Butterfield, Russell J; Pedersen, Brent S et al. (2018) Whole-genome analysis for effective clinical diagnosis and gene discovery in early infantile epileptic encephalopathy. NPJ Genom Med 3:22
Belyeu, Jonathan R; Nicholas, Thomas J; Pedersen, Brent S et al. (2018) SV-plaudit: A cloud-based framework for manually curating thousands of structural variants. Gigascience 7:
Pedersen, Brent S; Collins, Ryan L; Talkowski, Michael E et al. (2017) Indexcov: fast coverage quality control for whole-genome sequencing. Gigascience 6:1-6