Despite decades of effort, only a small portion of the heritability of genetic disorders can be currently explained. Two explanations for this gap are that the underlying genetic variants are rare and currently unknown, and, we have a poor understanding of the impact of the variants that we do have, in particular those residing outside of the coding regions. Addressing these issues requires both larger cohorts and more whole-genome functional assays (e.g RNA-seq, CHiP-seq, ATAC-seq, etc.). In recognition of projects like the Center for Common Genetic Disorders (CCGD), the Trans-Omics for Precision Medicine (TOPMed) Program and ENCODE are performing the gathering of massive amounts of genetic data across many different individuals and tissues. In aggregate, this data will dramatically improve our power to understanding how variation affects genomic architecture. The challenge is that these data are vast, complex, and multidimensional, and current methods cannot operate at this scale. This proposal addresses this challenge by splitting the data into two distinct types of data, genotypes and genome annotations, and developing technologies that are optimized to store and search each type independently. These two highly-scalable methods, which will be extremely valuable on their own, will then be integrated into a single system that enables queries across variation, gene expression, and regulation. For example, consider the question, ?Are there any tissues where de novo variants in case have a differential enrichment versus those in controls?? This question is decomposed into a genotype query that produces two sets of variants: de novos in case and de novos in controls. The sets then serve as input queries into a genome annotation search across all putative enhancers in all tissues. This proposal builds upon both my recently published Genotype Query Tools (GQT), a method that achieved vast speedups over other methods by operating directly on a compressed genotype index, and my past research and training in genome arithmetic algorithms, for which I have published multiple novel algorithms. Up to now I have focused on methods, so while the K99 phase of this project will include development, it will have a distinct focus on the analysis of disease cohorts. This additional training will be the foundation of an independent research program that will unlock the potential of large-scale genomics and functional data sets, providing for the fast and fluid integration between phenotype, genotype, and functional data.

Public Health Relevance

Many massive studies of genetic variation and gene expression among population-scale are underway worldwide in an attempt to gain a deeper understanding of the genetic basis of common and rare diseases. While these experiments hold tremendous promise, the reality is that extant algorithms and analysis tools simply do not scale to the size and complexity of the resulting datasets. This proposal aims to develop new indexing and searching algorithms and software that empower rapid data exploration and future discovery.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Career Transition Award (K99)
Project #
1K99HG009532-01
Application #
9295640
Study Section
National Human Genome Research Institute Initial Review Group (GNOM)
Program Officer
Gilchrist, Daniel A
Project Start
2017-05-01
Project End
2019-04-30
Budget Start
2017-05-01
Budget End
2018-04-30
Support Year
1
Fiscal Year
2017
Total Cost
Indirect Cost
Name
University of Utah
Department
Genetics
Type
Schools of Medicine
DUNS #
009095365
City
Salt Lake City
State
UT
Country
United States
Zip Code
84112
Belyeu, Jonathan R; Nicholas, Thomas J; Pedersen, Brent S et al. (2018) SV-plaudit: A cloud-based framework for manually curating thousands of structural variants. Gigascience 7:
Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya et al. (2018) GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods 15:123-126