The long term goal of our ongoing project, """"""""Discovering and applying knowledge in clinical databases,"""""""" is to learn from data in the electronic health record (EHR) and to apply that knowledge to relevant problems. The advent of the electronic health record (EHR) greatly amplifies the ability to carry out observational research, opening the possibility of covering emerging problems, diverse populations, rare diseases, and chronic diseases in long-term longitudinal studies. Unfortunately, the EHR carries additional challenges. We believe that the biggest challenge comes from the inaccuracy, incompleteness, complexity, and resulting bias inherent in the recording of the health care process. Put another way, EHR data are not simply research data with more noise and missing some values;instead the EHR carries systematic biases that must be addressed before the data can reach their potential. We propose to characterize the effects of the health care process on EHR data, to enumerate the potential biases, and to provide mechanisms to circumvent them. In effect, we propose to study the EHR as an object of interest in itself, using new models, data mining, existing knowledge bases, and innovative algorithms to better understand EHR biases so that we can identify them and correct them or avoid them. We include expertise from two of the nation's major phenotyping projects, eMERGE and OMOP. We hypothesize that we can learn about biases due to the health process through data mining and knowledge engineering and that we can correct or at least avoid those biases, enabling us to better answer informatics and clinical questions.
Our aims are as follows: (1) Study health care process biases by correlating raw EHR variables with a panel of health care process-related variables (e.g., admission), using lagged correlation to account for temporal effects, and populating a health care process resource with the correlations and observations. (2) Find associations among raw EHR variables using lagged correlation, information theory, Granger causality, and temporally ordered N-tuples of events, correcting for the health care process biases discovered in Aim 1. (3) Facilitate the definition of higher-level clinical phenotype concepts by applying knowledge resources-including eMERGE and OMOP phenotype definitions and ontologies such as our Medical Entities Dictionary and the UMLS-to the fruit of Aims 1 and 2 to produce semi-automated and automated phenotype query definitions. (4) Develop a high-throughput method to validate phenotype definitions by measuring the ability to uncover known associations, use the generated phenotypes and associations to answer clinical questions, and disseminate the results, including a large knowledge base of correlations that can be used by other researchers to conduct their own studies.

Public Health Relevance

This project studies the electronic health record in order to better understand how health care processes cause problems in the data. By avoiding or correcting those problems, we hope to improve reuse of the data for purposes such as clinical research and quality improvement.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Columbia University (N.Y.)
Internal Medicine/Medicine
Schools of Medicine
New York
United States
Zip Code
Polubriaginof, Fernanda C G; Vanguri, Rami; Quinnies, Kayla et al. (2018) Disease Heritability Inferred from Familial Relationships Reported in Medical Records. Cell 173:1692-1704.e11
Sottile, Peter D; Albers, David; Higgins, Carrie et al. (2018) The Association Between Ventilator Dyssynchrony, Delivered Tidal Volume, and Sedation Using a Novel Automated Ventilator Dyssynchrony Detection Algorithm. Crit Care Med 46:e151-e157
Schuemie, Martijn J; Ryan, Patrick B; Hripcsak, George et al. (2018) Improving reproducibility by using high-throughput observational studies with empirical calibration. Philos Trans A Math Phys Eng Sci 376:
Sottile, Peter D; Albers, David; Moss, Marc M (2018) Neuromuscular blockade is associated with the attenuation of biomarkers of epithelial and endothelial injury in patients with moderate-to-severe acute respiratory distress syndrome. Crit Care 22:63
Vilar, Santiago; Friedman, Carol; Hripcsak, George (2018) Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature and social media. Brief Bioinform 19:863-877
Ta, Casey N; Dumontier, Michel; Hripcsak, George et al. (2018) Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 5:180273
Grossman, Lisa V; Mitchell, Elliot G; Hripcsak, George et al. (2018) A method for harmonization of clinical abbreviation and acronym sense inventories. J Biomed Inform 88:62-69
Schuemie, Martijn J; Hripcsak, George; Ryan, Patrick B et al. (2018) Empirical confidence interval calibration for population-level effect estimation studies in observational healthcare data. Proc Natl Acad Sci U S A 115:2571-2577
Levine, Matthew E; Albers, David J; Hripcsak, George (2018) Methodological variations in lagged regression for detecting physiologic drug effects in EHR data. J Biomed Inform 86:149-159
Albers, D J; Elhadad, N; Claassen, J et al. (2018) Estimating summary statistics for electronic health record laboratory data for use in high-throughput phenotyping algorithms. J Biomed Inform 78:87-101

Showing the most recent 10 out of 120 publications