Algorithms to identify non-coding mutational burden and disease-relevant pathways

Voight, Benjamin

Abstract

Type 2 diabetes mellitus (T2D) is a disease with complex, polygenic etiology with numerous contributing mechanisms. To devise new therapeutics to combat this epidemic, we need to identify causal genes and variants for T2D and related, cardiometabolic traits implicated by human genetic association. Recently, large- scale DNA biobanks attached to electronic health records have facilitated extensive phenotyping in surprising large sample sizes (>500,000 subjects), importantly in diverse ancestries. These data have enabled a dramatic expansion of the number of bona-fide associations for T2D and related traits. While the increase in statistical power is certainly welcome, new opportunities for how to use these data require new computational methods and analytical pipelines. In this renewal, we focus on three areas for new methods development, which we will create and subsequently deploy to accelerate the genetic dissection for cardiometabolic disease. First, the number of associated loci now available permit the opportunity to learn directly from the data, which non- coding sequences functionally relate to T2D risk. We propose to utilize techniques in machine learning to make predictions for T2D and related causal traits, used to identify and prioritize causal variants and functional elements that are disease-predictive. A second challenge is that the quantity and pace at which this data is being produced is outstripping the rate at which even highly expert quantitative scientists can explore and extract novel insights from the data. To combat this problem, we propose to develop an informatics toolkit with apps to perform compute-intensive, important analyses and visualization with these data, tethered to cloud- based or local computation infrastructure. Finally, one key observation that follows biobank-based data analysis is that, at each physically distinct associated locus, numerous additional conditionally independent associations segregate nearby. This series of alleles can be identified through existing methods, but their use in causal inference approaches (i.e., Mendelian Randomization) has not been extensively explored. Here, we will evaluate their utility and develop statistical pipelines to use this spectrum of variation to perform new causal inference studies.

Public Health Relevance

Identification of causal variants and genes underlying type-2 diabetes (T2D) and related cardiometabolic trait associations are key challenge impeding biological understand and therapeutic developments. New, large- scale data sets from DNA biobanks are poised to help overcome these challenges but require the development of new informatics and statistical methods to take full advantage of the data. In this renewal application, we will develop new methods, machine learning applications, informatics tools, and causal inference statistical approaches to identify pinpoint casual variants and genes contributing to T2D and related traits.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)
Type: High Priority, Short Term Project Award (R56)
Project #: 2R56DK101478-06
Application #: 10007107
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Blondel, Olivier

Project Start: 2014-09-06
Project End: 2020-08-31
Budget Start: 2019-09-21
Budget End: 2020-08-31
Support Year: 6
Fiscal Year: 2019
Total Cost
Indirect Cost

Algorithms to identify non-coding mutational burden and disease-relevant pathways
Voight, Benjamin Franklin
University of Pennsylvania, Philadelphia, PA, United States

Abstract

Public Health Relevance

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Public Health Relevance

Funding Agency

Institution

Comments