The long-term aim of this project is to use natural language processing (NLP) to build a high throughput tool for facilitating cancer research by automatically extracting and organizing clinical and genetic information from the Electronic Medical Record (EMR) and from journal articles. Our research involves advanced NLP techniques to: 1) enable the mining of phenotypic and genotypic data in the EMR; 2) automatically amass knowledge concerned with cancer and biomolecular relationships from journals; 3) develop a WEB-enabled visualization tool for researchers that will present diverse views of the knowledge; and 4) develop an Infrastructure that will link to the clinical data warehouse at New York Presbyterian Hospital (NYPH) and to GeneWays, a related project that allows researchers to visualize pathways. More specifically, MedLEE (the NLP system we developed that extracts and encodes clinical and environmental information from the EMR) will be extended to extract genetic information contained in the EMR; subsequently, twelve years of patient reports will be processed and the extracted data added to the warehouse. In addition, a new system, PhenoGenes, will be developed based on MedLEE and GeneWays (which contains another NLP system we developed that extracts and codifies biomolecular relations from journal articles). PhenoGenes will capture biomolecular interactions directly associated with the treatment, diagnosis, and prognosis of cancer. It will also generate an XML knowledge base that will integrate and organize the information that will be captured, and a Web-enabled tool that will allow users to browse and view the knowledge clustered according to different orientations (e.g. gene, disease, tissue, interaction, etc.). The knowledge base will be linked to the GeneWays system, so that relevant pathways can be visualized. MedLEE is utilized operationally at NYPH. It also has been demonstrated that both NLP systems are highly effective. This current project builds upon our experience and success with these systems. The availability of related compatible clinical and biomolecular NLP systems, provide an exceptional opportunity to pave the way for capture, integration and organization of phenotypic and genotypic data and knowledge that will be used to radically improve patient care.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM007659-03
Application #
6912634
Study Section
Special Emphasis Panel (ZLM1-MMR-C (O1))
Program Officer
Ye, Jane
Project Start
2003-08-01
Project End
2007-07-31
Budget Start
2005-08-01
Budget End
2006-07-31
Support Year
3
Fiscal Year
2005
Total Cost
$478,937
Indirect Cost
Name
Columbia University (N.Y.)
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
621889815
City
New York
State
NY
Country
United States
Zip Code
10032
Van Vleck, Tielman T; Elhadad, Noémie (2010) Corpus-Based Problem Selection for EHR Note Summarization. AMIA Annu Symp Proc 2010:817-21
Borlawsky, Tara B; Li, Jianrong; Shagina, Lyudmila et al. (2010) Evaluation of an Ontology-anchored Natural Language-based Approach for Asserting Multi-scale Biomolecular Networks for Systems Medicine. AMIA Jt Summits Transl Sci Proc 2010:6-10
Morrison, Frances P; Li, Li; Lai, Albert M et al. (2009) Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes? J Am Med Inform Assoc 16:37-9
Hripcsak, George; Soulakis, Nicholas D; Li, Li et al. (2009) Syndromic surveillance using ambulatory electronic health records. J Am Med Inform Assoc 16:354-61
Wang, Xiaoyan; Hripcsak, George; Markatou, Marianthi et al. (2009) Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J Am Med Inform Assoc 16:328-37
Wang, Xiaoyan; Hripcsak, George; Friedman, Carol (2009) Characterizing environmental and phenotypic associations using information theory and electronic health records. BMC Bioinformatics 10 Suppl 9:S13
Morrison, Frances P; Sengupta, Soumitra; Hripcsak, George (2009) Using a pipeline to improve de-identification performance. AMIA Annu Symp Proc 2009:447-51
Xu, Hua; Stetson, Peter D; Friedman, Carol (2009) Methods for building sense inventories of abbreviations in clinical notes. J Am Med Inform Assoc 16:103-8
Sam, Lee T; Mendonça, Eneida A; Li, Jianrong et al. (2009) PhenoGO: an integrated resource for the multiscale mining of clinical and biological data. BMC Bioinformatics 10 Suppl 2:S8
Fan, Jung-Wei; Friedman, Carol (2009) Generating quality word sense disambiguation test sets based on MeSH indexing. AMIA Annu Symp Proc 2009:183-7

Showing the most recent 10 out of 53 publications