Observational epidemiological studies are effective methods for identifying factors affecting the health and illness of populations, as well as for determining optimal treatments for diseases, such as cancers. However, conventional epidemiological research usually involves personnel-intensive effort (such as manual chart and public records review) and can be very time consuming before conclusive results are obtained. Recently, a large amount of detailed longitudinal clinical data has been accumulated at hospitals'Electronic Medical Records (EMR) systems and it has become a valuable data source for epidemiological studies. However, there are two obstacles that prevent the wide usage of EMR data in epidemiological studies. First, most of the detailed clinical information in EMRs is embedded in narrative text and it is very costly to extract that information manually. Second, EMRs usually have data quality problems such as selection bias and missing data, which require adaptation of conventional statistical methods developed for randomized controlled trials. In this study, we propose an in silico informatics-based approach for observational epidemiological studies using EMR data. We hypothesize that existing EMR data can be used for certain types of epidemiological studies in a very efficient manner with the help of informatics methods. The informatics-based approach will contain two major components. One is an NLP (Natural Language Processing) based information extraction system that can automatically extract detailed clinical information from EMR and another is a set of statistical and informatics methods that can be used to analyze EMR-derived data. If the feasibility of this approach is proven, it will change the standard paradigm of observational epidemiological research, because it has the capability to answer an epidemiological question in a very short time at a very low cost.
The specific aim of this study is to develop an automated informatics approach to extract both fine-grained cancer findings and general clinical information from EMRs and use them to conduct cancer related epidemiological studies. We will perform both casecontrol and cohort studies related to prevention and treatment of breast and colon cancers using EMR data. The informatics approach will be validated on EMRs from two major hospitals to demonstrate its generalizability. Epidemiological findings from our study will be compared to reported findings for validation.

Public Health Relevance

According to the American Cancer Society, about 7.6 million people died from various types of cancer in the world during 2007. It is very important to identify risk factors of cancers and to determine optimal treatments of cancers, and epidemiological study is one of the methods to achieve it. This proposed study will use natural language processing technologies to automatically extract fine-grained cancer information from existing patient electronic medical records and use it to conduct cancer related epidemiological studies, thus accelerating knowledge accumulation of cancer research.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-SRLB-G (M1))
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Xu, Hua; Aldrich, Melinda C; Chen, Qingxia et al. (2015) Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality. J Am Med Inform Assoc 22:179-91
Zhang, Yaoyun; Tang, Buzhou; Jiang, Min et al. (2015) Domain adaptation for semantic role labeling of clinical text. J Am Med Inform Assoc 22:967-79
Wu, Y; Denny, J C; Rosenbloom, S T et al. (2015) A Preliminary Study of Clinical Abbreviation Disambiguation in Real Time. Appl Clin Inform 6:364-74
Zhang, Yaoyun; Soysal, Ergin; Moon, Sungrim et al. (2015) Integrating Multiple On-line Knowledge Bases for Disease-Lab Test Relation Extraction. AMIA Jt Summits Transl Sci Proc 2015:204-8
Liu, Mei; Cai, Ruichu; Hu, Yong et al. (2014) Determining molecular predictors of adverse drug reactions with causality analysis based on structure learning. J Am Med Inform Assoc 21:245-51
Jiang, Min; Wu, Yonghui; Shah, Anushi et al. (2014) Extracting and standardizing medication information in clinical text - the MedEx-UIMA system. AMIA Jt Summits Transl Sci Proc 2014:37-42
Wiley, Laura K; Shah, Anushi; Xu, Hua et al. (2013) ICD-9 tobacco use codes are effective identifiers of smoking status. J Am Med Inform Assoc 20:652-8
Tang, Buzhou; Wu, Yonghui; Jiang, Min et al. (2013) A hybrid system for temporal information extraction from clinical text. J Am Med Inform Assoc 20:828-35
Van Driest, Sara L; Shah, Anushi; Marshall, Matthew D et al. (2013) Opioid use after cardiac surgery in children with Down syndrome. Pediatr Crit Care Med 14:862-8
Fan, Jung-wei; Yang, Elly W; Jiang, Min et al. (2013) Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences. J Am Med Inform Assoc 20:1168-77

Showing the most recent 10 out of 30 publications