Our long-term goal is to optimize the design and conduct of human clinical research using informatics1. Eligibility criteria define the study population for every human study. Their clarity, accuracy and precision are crucial to the success of participant recruitment, results dissemination, and evidence synthesis. Our goal for this renewal is to build a data-driven and knowledge-based decision aid for real-life clinical researchers to optimize research eligibility criteria definition. The difference in the semantic representation of an eligibility criterion (e.g., having Type 2 diabetes mellitus) and its operationalization as a clinical variable (e.g., HbA1C ? 6.5% or ICD-9 code = ?250.00?) has been defined as the semantic gap2, the closing of which is a grand challenge for biomedical informatics2,3. Our research has contributed to the in-depth understanding of this semantic gap and how it limits computational reuse and effective communication of eligibility criteria to key stakeholders of clinical research4-9. We have developed informatics methods to help bridge this gap, by transforming free-text eligibility criteria into semi-structured formats to aid in study cohort identification10-13, analysis of the population representativeness of related clinical trials14-19, text mining of common eligibility features and their trends18,20-24, and identification of questionable exclusion criteria for mental disorder trials25. We used several of these methods to develop a visualization system called VITTA17 that shows how eligibility criteria and the clinical features of clinical trial populations vary across related trials. More importantly, our research has revealed an understudied root cause of the semantic gap, which is that eligibility criteria are often poorly defined, inaccurate, nonspecific, or imprecise, and not easily translatable to the real-world electronic health record (EHR) data representations to which the criteria must be operationalized. The advent of Big Patient Data offers an unprecedented opportunity to draw on the characteristics of real-world patients to guide and inform the data-driven precise definition of eligibility criteria25. By defining the characteristics of the intended study population, eligibility criteria critically influence the population representativeness of a clinical study, which further influences the tradeoff between patient safety and research results? replicability and generalizability. We hypothesize that by integrating patient data, including clinical and genomic data, with public clinical trial information, we can proactively guide investigators to optimize the precision, recruitment feasibility and representativeness of eligibility criteria. This research will demonstrate a novel data-driven and knowledge-based system to assist researchers with optimizing eligibility criteria, through innovative informatics methods for integrating proprietary and public data for deep phenotyping, target population profiling, and quantification and visualization of population representativeness.

Public Health Relevance

This research will increase the transparency of the population representativeness of clinical research eligibility criteria, reduce selection biases, improve research reliability, and enhance the patient-centeredness of clinical studies and thus mitigate health disparities.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Columbia University (N.Y.)
Internal Medicine/Medicine
Schools of Medicine
New York
United States
Zip Code
Butler, Alex; Wei, Wei; Yuan, Chi et al. (2018) The Data Gap in the EHR for Clinical Research Eligibility Screening. AMIA Jt Summits Transl Sci Proc 2017:320-329
Ta, Casey N; Dumontier, Michel; Hripcsak, George et al. (2018) Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records. Sci Data 5:180273
Grossman, Lisa V; Mitchell, Elliot G; Hripcsak, George et al. (2018) A method for harmonization of clinical abbreviation and acronym sense inventories. J Biomed Inform 88:62-69
Son, Jung Hoon; Xie, Gangcai; Yuan, Chi et al. (2018) Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes. Am J Hum Genet 103:58-73
Weng, Chunhua; Goldstein, Andrew; Yuan, Chi et al. (2018) The ranking of scientists. J Biomed Inform 79:145-146
Si, Yuqi; Weng, Chunhua (2017) An OMOP CDM-Based Relational Database of Clinical Research Eligibility Criteria. Stud Health Technol Inform 245:950-954
Sen, Anando; Ryan, Patrick B; Goldstein, Andrew et al. (2017) Correlating eligibility criteria generalizability and adverse events using Big Data for patients and clinical trials. Ann N Y Acad Sci 1387:34-43
He, Zhe; Langford, Aisha (2017) Comparative Analysis of Geriatric and Adult Drug Clinical Trials on ClinicalTrials.gov. Stud Health Technol Inform 245:1265
Kang, Tian; Zhang, Shaodian; Tang, Youlan et al. (2017) EliIE: An open-source information extraction system for clinical trial eligibility criteria. J Am Med Inform Assoc 24:1062-1071
He, Zhe; Gonzalez-Izquierdo, Arturo; Denaxas, Spiros et al. (2017) Comparing and Contrasting A Priori and A Posteriori Generalizability Assessment of Clinical Trials on Type 2 Diabetes Mellitus. AMIA Annu Symp Proc 2017:849-858

Showing the most recent 10 out of 99 publications