The goal of cancer pharmacoepidemiology is to identify adverse and/or long-term effects of chemotherapeutic agents and determine the impact of drugs on cancer risk, prevention, and response to treatments. Pharmacoepidemiology studies exert strong influence on defining optimal treatments and accelerating translational research. Therefore, it is imperative for these to be done efficiently and leveraging real-world patient data such as electronic health records (EHR). Massive clinical data from EHRs are being tapped into for research in disease-gene associations, comparative effectiveness and clinical outcomes. There is however paucity in pharmacoepidemiological studies using comprehensive EHR data due to the inherent challenges that exist for data abstraction, handling and analysis. The hurdles include heterogeneity of reports, embedding of detailed clinical information in narrative text, differing EHR platforms across different sites and missing data to name a few. In this study, we propose to integrate and extend preexisting tools to build an informatics infrastructure for EHR data extraction, interpretation, management and analysis to advance cancer pharmacoepidemiology research. We will leverage existing tools of natural language processing (NLP), standardized ontologies and clinical data management systems to extract and manipulate EHR data for cancer pharmacoepidemiological research. To achieve our goal we propose four specific aims.
In aim 1, we intend to develop a high-performance, user- centric information extraction framework with advanced features such as active learning (to reduce annotation cost), domain adaptation (to transfer data across multiple sites) and user-friendly interfaces (for non-technical end users).
In aim 2, we plan to improve data harmonization across differing platforms, develop components for seamless data export as well as expand methodologies to address impediments inherent to EHR-based data (such as the missing data problem).
In aim 3, we will conduct demonstration projects of cancer pharmacoepidemiology including pharmacovigilance and pharmacogenomics of chemotherapeutic agents to evaluate, refine and validate the broad uses of our tools. Finally in aim 4, we propose to disseminate the methods and tools developed in this project to the cancer research and pharmacoepidemiology communities.

Public Health Relevance

In this project, we propose to integrate and extend previously developed tools to build an informatics infrastructure for electronic health records (EHR) data extraction, interpretation, management, and analysis, to advance cancer pharmacoepidemiology research. Such methods can efficiently integrate and standardize cancer pharmacoepidemiology specific information from EHRs across different sites, thus advancing research in this field.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Resource-Related Research Projects--Cooperative Agreements (U24)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Friedman, Steve
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Health Science Center Houston
Sch Allied Health Professions
United States
Zip Code
Lee, Hee-Jin; Wu, Yonghui; Zhang, Yaoyun et al. (2017) A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform 75S:S19-S27