The overarching goal of the project is to establish a genomic medicine learning system to accelerate genomic knowledge discovery and application in electronic health records (EHRs). We will integrate deep characteristic phenotypes extracted from EHRs and evolving knowledge of genotype-phenotype associations to optimize the accuracy of variant interpretation and the cost-effectiveness of clinical genome/exome sequencing, and to accelerate the discovery of causal genes by constructing a dynamic genotype-phenotype knowledge network. Prior knowledge on phenotype-gene relationships and phenotypic information about patients can facilitate the identification of disease-causing mutations from thousands of genetic variants in the context of clinical genomic sequencing; however, how best to abstract phenotype information from notes in the EHRs of patients who are diagnosed with or evaluated for monogenetic disorders, standardize the computable representation of phenotypes, and utilize it in genomic interpretation remains unclear. Additionally, how to systematically compare phenotypes across diseases to discover new knowledge in human genetics remains a largely untapped area with great promise. To address these challenges, we will develop and validate scalable and portable open-source natural language processing (NLP) methods for automated and accurate abstraction of characteristic phenotype concepts (e.g., ?j-shaped sella turcica? and ?short stature?) from EHR narratives. We will then develop a phenotype-driven scoring system called EHR-Phenolyzer to predict the likely candidate genetic variants associated with the phenotypes for patients with genomic sequencing and a high probability of a monogenic condition. On this basis, we will develop a probabilistic disease diagnosis and knowledge discovery system using rich and deep EHR phenotypes, and evaluate these methods for genomic diagnosis and discovery using large- scale clinical exome sequencing data. Ultimately, these methods will support efficient, effective, and scalable genomic diagnostics, and facilitate the implementation of genome-guided precision medicine in clinical practice.

Public Health Relevance

We will develop novel informatics methods to abstract characteristic phenotypes from electronic health records (EHRs) for patients diagnosed with or evaluated for monogenetic disorders, enable the interoperability of computable characteristic phenotypes with existing phenotype-genotype association knowledge such as OMIM and ClinVar, and improve the efficiency and effectiveness of genomic diagnostics.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM012895-03
Application #
9925808
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
2018-09-17
Project End
2022-05-31
Budget Start
2020-06-01
Budget End
2021-05-31
Support Year
3
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Columbia University (N.Y.)
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
621889815
City
New York
State
NY
Country
United States
Zip Code
10032