The long-term goal of our research is to develop resources and natural language processing (NLP)systems for knowledge management in the biomedical domain. As biomedical data stored in disparateresources undergo a very rapid growth in both scale and complexity, ontology-based knowledge managementis becoming increasingly popular since it provides explicit descriptions of biomedical entities and an approachto annotating and analyzing the results of biomedical research. Much of information and knowledge relevant tobiomedical research is still recorded in free text format. In the past decade, NLP has been shown to have thepotential to accelerate the biomedical knowledge management process. One critical component in NLPsystems is identifying gene/protein names (i.e., gene/protein name identification) and normalizing them tostandard representations (i.e., gene/protein name normalization). Gene/protein name identification has beentackled with good performance but gene/protein name normalization tends to be challenging. First, there is alack of standard representations for gene/protein names. Researchers have used structured databases such asprotein database, UniProtKB, or gene resource Entrez Gene as the reference for names. But it is problematic toassociate names to individual records in those databases since a name in text can be generic and refer to agroup of records. Additionally, like other biomedical concepts such as diseases or lab procedures, genes orproteins usually appear in text as short forms abbreviated from their names or descriptions. The prevalent useof short forms is another challenge faced by NLP applications because of very high ambiguity of short forms.Specifically, the proposed research aims to:1) develop onto-BioThesaurus by enriching BioThesaurus, an existing gene/protein thesaurus, withgene/protein-related ontologies. Hypothesis: aligning gene/protein names to gene/protein-related ontologiescan i) detect systematic ambiguity, ii) enable automatic reasoning during gene/protein named entity tagging, andiii) facilitate ontology-based knowledge management;2) enhance onto-BioThesaurus by harvesting short form knowledge from online resources and text.Hypothesis: harvesting synonyms especially gene/protein short forms is critical for resolving the ambiguity,synonymy, and novelty problem for gene/protein name normalization;3) normalize gene/protein names using onto-BioThesaurus. Hypothesis: there are several advantages (i.e.,lowering ambiguity, handling novelty, and linking gene/protein concepts to biomedical ontologies) over thetraditional gene/protein name normalization when using onto-BioThesaurus and we expect improvedperformance of various lookup and disambiguation methods; and4) evaluate research methods and distribute research outcome. Hypothesis: evaluating research methodsand distributing research outcome to public are critical to advance basic and applied biomedical science.

Public Health Relevance

The proposed research is critical for biomedical knowledge management and literature mining. It serves as one of the foundation for any automated application that stores, retrieves, and extracts information from free text in the biomedical domain. Additionally, the proposed research will benefit biomedical researchers and general community for understanding and managing biomedical text through web interfaces and automated systems.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
7R01LM009959-03
Application #
8448471
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
2009-09-01
Project End
2013-09-29
Budget Start
2011-06-30
Budget End
2013-09-29
Support Year
3
Fiscal Year
2010
Total Cost
$614,000
Indirect Cost
Name
Mayo Clinic, Rochester
Department
Type
DUNS #
006471700
City
Rochester
State
MN
Country
United States
Zip Code
55905
Elayavilli, Ravikumar Komandur; Liu, Hongfang (2016) Ion Channel ElectroPhysiology Ontology (ICEPO) - a case study of text mining assisted ontology development. AMIA Jt Summits Transl Sci Proc 2016:42-51
Li, Dingcheng; Okamoto, Janet; Liu, Hongfang et al. (2015) A bibliometric analysis on tobacco regulation investigators. BioData Min 8:11
Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng et al. (2015) Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature. BMC Bioinformatics 16:185
Li, Ding-Cheng; Rastegar-Mojarad, Majid; Okamoto, Janet et al. (2015) A Bibliometric Analysis on Cancer Population Science with Topic Modeling. AMIA Jt Summits Transl Sci Proc 2015:102-6
Ravikumar, K E; Wagholikar, Kavishwar B; Liu, Hongfang (2014) Towards pathway curation through literature mining--a case study using PharmGKB. Pac Symp Biocomput :352-63
Liu, Hongfang; Sohn, Sunghwan; Murphy, Sean et al. (2014) Facilitating post-surgical complication detection through sublanguage analysis. AMIA Jt Summits Transl Sci Proc 2014:77-82
Wu, Stephen T; Juhn, Young J; Sohn, Sunghwan et al. (2014) Patient-level temporal aggregation for text-based asthma status ascertainment. J Am Med Inform Assoc 21:876-84
Moosavinasab, Soheil; Rastegar-Mojarad, Majid; Liu, Hongfang et al. (2014) Towards Transforming Expert-based Content to Evidence-based Content. AMIA Jt Summits Transl Sci Proc 2014:83-90
Li, Ding Cheng; Thermeau, Terry; Chute, Christopher et al. (2014) Discovering associations among diagnosis groups using topic modeling. AMIA Jt Summits Transl Sci Proc 2014:43-9
Zhang, Yuji; Tao, Cui (2014) Network Analysis of Cancer-focused Association Network Reveals Distinct Network Association Patterns. Cancer Inform 13:45-51

Showing the most recent 10 out of 47 publications