Type 2 diabetes mellitus (T2D) is a disease with complex, polygenic etiology with numerous contributing mechanisms. To devise new therapeutics to combat this epidemic, we need to identify causal genes and variants for T2D and related, cardiometabolic traits implicated by human genetic association. Recently, large- scale DNA biobanks attached to electronic health records have facilitated extensive phenotyping in surprising large sample sizes (>500,000 subjects), importantly in diverse ancestries. These data have enabled a dramatic expansion of the number of bona-fide associations for T2D and related traits. While the increase in statistical power is certainly welcome, new opportunities for how to use these data require new computational methods and analytical pipelines. In this renewal, we focus on three areas for new methods development, which we will create and subsequently deploy to accelerate the genetic dissection for cardiometabolic disease. First, the number of associated loci now available permit the opportunity to learn directly from the data, which non- coding sequences functionally relate to T2D risk. We propose to utilize techniques in machine learning to make predictions for T2D and related causal traits, used to identify and prioritize causal variants and functional elements that are disease-predictive. A second challenge is that the quantity and pace at which this data is being produced is outstripping the rate at which even highly expert quantitative scientists can explore and extract novel insights from the data. To combat this problem, we propose to develop an informatics toolkit with apps to perform compute-intensive, important analyses and visualization with these data, tethered to cloud- based or local computation infrastructure. Finally, one key observation that follows biobank-based data analysis is that, at each physically distinct associated locus, numerous additional conditionally independent associations segregate nearby. This series of alleles can be identified through existing methods, but their use in causal inference approaches (i.e., Mendelian Randomization) has not been extensively explored. Here, we will evaluate their utility and develop statistical pipelines to use this spectrum of variation to perform new causal inference studies.

Public Health Relevance

Identification of causal variants and genes underlying type-2 diabetes (T2D) and related cardiometabolic trait associations are key challenge impeding biological understand and therapeutic developments. New, large- scale data sets from DNA biobanks are poised to help overcome these challenges but require the development of new informatics and statistical methods to take full advantage of the data. In this renewal application, we will develop new methods, machine learning applications, informatics tools, and causal inference statistical approaches to identify pinpoint casual variants and genes contributing to T2D and related traits.

National Institute of Health (NIH)
National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)
High Priority, Short Term Project Award (R56)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Blondel, Olivier
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Schools of Medicine
United States
Zip Code