The routine operation of the US Healthcare system produces an abundance of electronically-stored data that captures the care of patients as it is provided in settings outside of controlled research environments. The potential for utilizing these data to inform future treatment choices and improve patient care and outcomes of all patients in the very system that generates the data is widely acknowledged. Given these key properties of the routine-care data and the abundance of electronic healthcare databases covering millions of patients, it is critical to strengthen the rigor of analyses of such data. Our group has previously developed an analytic approach to reduce bias when analyzing routine-care databases, which has proven effective in more than 50 empirical research studies across a range of topics and data sources. However, this approach currently cannot incorporate free-text information that is recorded in electronic health records, such as clinical notes and reports. This limitation has left a large amount of rich patient information underutilized for clinical research. We thus aim to adapt and refine a set of established computerized natural language processing algorithms that can identify and extract useful information from the clinical notes and reports in electronic health records and incorporate them into our validated analytical approach for balancing background risks of different comparison groups, a key step to ensure fair evaluation when comparing different therapeutic options. To test this newly integrated and augmented approach, we will implement and adapt it in simulation studies where we can evaluate and improve the performance of these new analytic methods in a controlled but realistic fashion. In addition, we will assess the performance of our new approach in 8 practical studies comparing medical or surgical treatments that are highly relevant to patients. To ensure highest level of data completeness and quality, we have linked multiple healthcare utilization (claims) databases, spanning from 2007 to 2016, with 3 electronic health records systems, including one each in Massachusetts, North Carolina, and Texas. This data will allow testing of our newly integrated approach in a variety of care delivery systems and data environments, which will be very informative for the application of our products in the real-world settings.

Public Health Relevance

The project will yield a highly flexible and effective analytical method for reducing confounding bias in studies that utilize routine-care data to compare effects of medical or surgical treatments. This method will enable researchers to leverage a large amount of patient information recorded in the clinical notes and reports that are contained within electronic health records to adjust for differences in background risks of different comparison groups. Our proposal can improve the quality of evidence based on electronic healthcare data generated in the routine-care settings to better inform patient care and optimal prescribing.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Brigham and Women's Hospital
United States
Zip Code