Observational epidemiological studies are effective methods for identifying factors affecting the health and illness of populations, as well as for determining optimal treatments for diseases, such as cancers. However, conventional epidemiological research usually involves personnel-intensive effort (such as manual chart and public records review) and can be very time consuming before conclusive results are obtained. Recently, a large amount of detailed longitudinal clinical data has been accumulated at hospitals'Electronic Medical Records (EMR) systems and it has become a valuable data source for epidemiological studies. However, there are two obstacles that prevent the wide usage of EMR data in epidemiological studies. First, most of the detailed clinical information in EMRs is embedded in narrative text and it is very costly to extract that information manually. Second, EMRs usually have data quality problems such as selection bias and missing data, which require adaptation of conventional statistical methods developed for randomized controlled trials. In this study, we propose an in silico informatics-based approach for observational epidemiological studies using EMR data. We hypothesize that existing EMR data can be used for certain types of epidemiological studies in a very efficient manner with the help of informatics methods. The informatics-based approach will contain two major components. One is an NLP (Natural Language Processing) based information extraction system that can automatically extract detailed clinical information from EMR and another is a set of statistical and informatics methods that can be used to analyze EMR-derived data. If the feasibility of this approach is proven, it will change the standard paradigm of observational epidemiological research, because it has the capability to answer an epidemiological question in a very short time at a very low cost.
The specific aim of this study is to develop an automated informatics approach to extract both fine-grained cancer findings and general clinical information from EMRs and use them to conduct cancer related epidemiological studies. We will perform both case- control and cohort studies related to prevention and treatment of breast and colon cancers using EMR data. The informatics approach will be validated on EMRs from two major hospitals to demonstrate its generalizability. Epidemiological findings from our study will be compared to reported findings for validation.

Public Health Relevance

According to the American Cancer Society, about 7.6 million people died from various types of cancer in the world during 2007. It is very important to identify risk factors of cancers and to determine optimal treatments of cancers, and epidemiological study is one of the methods to achieve it. This proposed study will use natural language processing technologies to automatically extract fine-grained cancer information from existing patient electronic medical records and use it to conduct cancer related epidemiological studies, thus accelerating knowledge accumulation of cancer research.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-SRLB-G (M1))
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Liu, Mei; Cai, Ruichu; Hu, Yong et al. (2014) Determining molecular predictors of adverse drug reactions with causality analysis based on structure learning. J Am Med Inform Assoc 21:245-51
Tang, Buzhou; Cao, Hongxin; Wu, Yonghui et al. (2013) Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features. BMC Med Inform Decis Mak 13 Suppl 1:S1
Wiley, Laura K; Shah, Anushi; Xu, Hua et al. (2013) ICD-9 tobacco use codes are effective identifiers of smoking status. J Am Med Inform Assoc 20:652-8
Chen, Yukun; Cao, Hongxin; Mei, Qiaozhu et al. (2013) Applying active learning to supervised word sense disambiguation in MEDLINE. J Am Med Inform Assoc 20:1001-6
Fan, Jung-wei; Yang, Elly W; Jiang, Min et al. (2013) Syntactic parsing of clinical text: guideline and corpus development with handling ill-formed sentences. J Am Med Inform Assoc 20:1168-77
Van Driest, Sara L; Shah, Anushi; Marshall, Matthew D et al. (2013) Opioid use after cardiac surgery in children with Down syndrome. Pediatr Crit Care Med 14:862-8
Chen, Yukun; Mani, Subramani; Xu, Hua (2012) Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform 45:265-72
Jiang, Min; Chen, Yukun; Liu, Mei et al. (2011) A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc 18:601-6
Xu, Hua; AbdelRahman, Samir; Lu, Yanxin et al. (2011) Applying semantic-based probabilistic context-free grammar to medical language processing--a preliminary study on parsing medication sentences. J Biomed Inform 44:1068-75
Xu, Hua; Jiang, Min; Oetjens, Matt et al. (2011) Facilitating pharmacogenetic studies using electronic health records and natural-language processing: a case study of warfarin. J Am Med Inform Assoc 18:387-91

Showing the most recent 10 out of 13 publications