Falling costs of generating genomic data and computational advances in discerning health-affecting variants therein are bringing personalized molecular medicine closer to reality. Progress has also been made on establishing guidelines (e.g., by the American College of Medical Genetics and Genomics) for the interpretation of sequence variants. However, the crucial step of systematically and accurately interpreting their clinical implications remains an unsolved problem. Specifically, clinical interpretation is technically challenging for several reasons, including: 1) the enormous number of variants in individual genomes, making it difficult to pinpoint causal variants, 2) limited functional/clinical data at the gene and variant levels, 3) discovery of novel clinical variants is a tedious low-throughput process using traditional laboratory and clinical approaches, and 4) conventional bioinformatics tools tend to have insufficient precision based on limitations imposed by linear sequence analysis alone. As a result, clinical genomics is still far too costly for routine clinical use. To meet the urgent need of high precision clinical variant interpretation, our proposal aims to 1) build upon existing clinical knowledge (ClinVar) from ClinGen efforts, 2) utilize rich human variation data in public databases (e.g., ExAC and dbSNP), and 3) leverage existing and upcoming sequencing data from large disease cohorts and small family studies; all to support developing/employing a cross-cutting computational/experimental strategy for clinical variant discovery at a massive scale across a broad spectrum of human diseases. We hypothesize that variants clustering in 3D spatial proximity to known pathogenic variants have high probabilities of affecting protein function. We hypothesize further that many pathogenic variants in databases such as ExAC remain undetected/hidden due to their recessive nature or their rarity that limits statistical power for detection in association analyses. To test these hypotheses and to establish a database for functionally important variants associated with human diseases, we propose to develop a software system called ClinPath3D to detect and characterize clinically relevant pathogenic variants. Essentially, it will utilize protein structures and variant pathogenicity potential to identify 3D spatial pathogenic variant clusters (PVCs) (Aim 1). We will then apply ClinPath3D to interpret rare variants of unknown significance (VUS) from the ExAC, dbSNP, and other variant databases using pathogenic variants obtained from ClinVar as nucleation points for clustering, all with a view toward discerning disease variants in the general population (Aim 2). Finally, we will use large sequencing data sets (CCDG, TopMed, UK100K) to statistically assess variant enrichment in specific disease cohorts and will further improve positive results by experimentally characterizing 50-100 high-priority variants in kinases and 50-100 in transcription factors (Aim 3). Results from these studies will contribute to clinical advancement in two key ways: (1) methodological improvement of identifying pathogenic/functional variants in patient genomes and (2) the building of a comprehensive database of clinically relevant variants across a broad spectrum of disease types.

Public Health Relevance

Understanding the whole spectrum of pathogenic genetic changes in human diseases will lead to effective diagnosis and treatment strategies for each patient. Development of advanced computational approaches with high accuracy for predicting variants with pathogenic impact will be a prerequisite for successful diagnosis, treatment, and management of human diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Chang, Christine Q
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Internal Medicine/Medicine
Schools of Medicine
Saint Louis
United States
Zip Code
Jayasinghe, Reyka G; Cao, Song; Gao, Qingsong et al. (2018) Systematic Analysis of Splice-Site-Creating Mutations in Cancer. Cell Rep 23:270-281.e3
Bailey, Matthew H; Tokheim, Collin; Porta-Pardo, Eduard et al. (2018) Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 173:371-385.e18
Huang, Kuan-Lin; Mashl, R Jay; Wu, Yige et al. (2018) Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173:355-370.e14