Despite decades of effort, only a small portion of the heritability of genetic disorders can be currently explained. Two explanations for this gap are that the underlying genetic variants are rare and currently unknown, and, we have a poor understanding of the impact of the variants that we do have, in particular those residing outside of the coding regions. Addressing these issues requires both larger cohorts and more whole-genome functional assays (e.g RNA-seq, CHiP-seq, ATAC-seq, etc.). In recognition of projects like the Center for Common Genetic Disorders (CCGD), the Trans-Omics for Precision Medicine (TOPMed) Program and ENCODE are performing the gathering of massive amounts of genetic data across many different individuals and tissues. In aggregate, this data will dramatically improve our power to understanding how variation affects genomic architecture. The challenge is that these data are vast, complex, and multidimensional, and current methods cannot operate at this scale. This proposal addresses this challenge by splitting the data into two distinct types of data, genotypes and genome annotations, and developing technologies that are optimized to store and search each type independently. These two highly-scalable methods, which will be extremely valuable on their own, will then be integrated into a single system that enables queries across variation, gene expression, and regulation. For example, consider the question, ?Are there any tissues where de novo variants in case have a differential enrichment versus those in controls?? This question is decomposed into a genotype query that produces two sets of variants: de novos in case and de novos in controls. The sets then serve as input queries into a genome annotation search across all putative enhancers in all tissues. This proposal builds upon both my recently published Genotype Query Tools (GQT), a method that achieved vast speedups over other methods by operating directly on a compressed genotype index, and my past research and training in genome arithmetic algorithms, for which I have published multiple novel algorithms. Up to now I have focused on methods, so while the K99 phase of this project will include development, it will have a distinct focus on the analysis of disease cohorts. This additional training will be the foundation of an independent research program that will unlock the potential of large-scale genomics and functional data sets, providing for the fast and fluid integration between phenotype, genotype, and functional data.

Public Health Relevance

Many massive studies of genetic variation and gene expression among population-scale are underway worldwide in an attempt to gain a deeper understanding of the genetic basis of common and rare diseases. While these experiments hold tremendous promise, the reality is that extant algorithms and analysis tools simply do not scale to the size and complexity of the resulting datasets. This proposal aims to develop new indexing and searching algorithms and software that empower rapid data exploration and future discovery.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Transition Award (R00)
Project #
5R00HG009532-04
Application #
9769844
Study Section
Special Emphasis Panel (NSS)
Program Officer
Sen, Shurjo Kumar
Project Start
2018-08-20
Project End
2021-06-30
Budget Start
2019-07-01
Budget End
2020-06-30
Support Year
4
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of Colorado at Boulder
Department
Miscellaneous
Type
Organized Research Units
DUNS #
007431505
City
Boulder
State
CO
Country
United States
Zip Code
80303