The overarching goal of the project is to establish a genomic medicine learning system to accelerate genomic knowledge discovery and application in electronic health records (EHRs). We will integrate deep characteristic phenotypes extracted from EHRs and evolving knowledge of genotype-phenotype associations to optimize the accuracy of variant interpretation and the cost-effectiveness of clinical genome/exome sequencing, and to accelerate the discovery of causal genes by constructing a dynamic genotype-phenotype knowledge network. Prior knowledge on phenotype-gene relationships and phenotypic information about patients can facilitate the identification of disease-causing mutations from thousands of genetic variants in the context of clinical genomic sequencing; however, how best to abstract phenotype information from notes in the EHRs of patients who are diagnosed with or evaluated for monogenetic disorders, standardize the computable representation of phenotypes, and utilize it in genomic interpretation remains unclear. Additionally, how to systematically compare phenotypes across diseases to discover new knowledge in human genetics remains a largely untapped area with great promise. To address these challenges, we will develop and validate scalable and portable open-source natural language processing (NLP) methods for automated and accurate abstraction of characteristic phenotype concepts (e.g., ?j-shaped sella turcica? and ?short stature?) from EHR narratives. We will then develop a phenotype-driven scoring system called EHR-Phenolyzer to predict the likely candidate genetic variants associated with the phenotypes for patients with genomic sequencing and a high probability of a monogenic condition. On this basis, we will develop a probabilistic disease diagnosis and knowledge discovery system using rich and deep EHR phenotypes, and evaluate these methods for genomic diagnosis and discovery using large- scale clinical exome sequencing data. Ultimately, these methods will support efficient, effective, and scalable genomic diagnostics, and facilitate the implementation of genome-guided precision medicine in clinical practice.
We will develop novel informatics methods to abstract characteristic phenotypes from electronic health records (EHRs) for patients diagnosed with or evaluated for monogenetic disorders, enable the interoperability of computable characteristic phenotypes with existing phenotype-genotype association knowledge such as OMIM and ClinVar, and improve the efficiency and effectiveness of genomic diagnostics.