Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our own research in the past has mostly focused on identifying genes and species in PubMed citations. In 2012-2013, we have continued our efforts in improving disease name detection in free text. Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text - the task of disease name normalization - compared with other normalization tasks (e.g. gene normalization) in biomedical text mining research. To this end, we developed the first machine-learning approach for disease name normalization (DNorm), based on our previously developed NCBI Disease corpus. Our technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in very large optimization problems for information retrieval. When comparing our method to traditional approaches based on lexical normalization and matching, DNorm achieved 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Our DNorm system was also applied to the 2013 ShARe/CLEF eHealth Shared Task, a worldwide challenge task of recognizing disease names in 4 different types of clinical notes (e.g. discharge summary). We achieved the highest performance in the disease normalization task among all 16 international participating teams. Besides disease name recognition, we also improved the state of the art in finding mutations in free text. Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy. Hence, we developed tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in traditional methods. Using a novel CRF label model and feature set, tmVar achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we developed PubTator and assessed its use in assisting manual biocuration through participation in BioCreative 2012, an international workshop that brings together the text mining and biology communities to drive the development of text mining systems that can be integrated into the biocuration workflow and the knowledge discovery process. Our PubTator system is a Web-based tool for assisting manual literature curation: a highly expensive endeavor of extracting knowledge from biomedical literature into structured databases. Our PubTator system is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatically computed results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is currently being used in assisting real-life database curation.

Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2013
Total Cost
$821,852
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
van Asten, Freekje; Simmons, Michael; Singhal, Ayush et al. (2018) A Deep Phenotype Association Study Reveals Specific Phenotype Associations with Genetic Variants in Age-related Macular Degeneration: Age-Related Eye Disease Study 2 (AREDS2) Report No. 14. Ophthalmology 125:559-568
Allot, Alexis; Peng, Yifan; Wei, Chih-Hsuan et al. (2018) LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res 46:W530-W536
Peng, Yifan; Wang, Xiaosong; Lu, Le et al. (2018) NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt Summits Transl Sci Proc 2017:188-196
Ching, Travers; Himmelstein, Daniel S; Beaulieu-Jones, Brett K et al. (2018) Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 15:
Lee, Kyubum; Famiglietti, Maria Livia; McMahon, Aoife et al. (2018) Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLoS Comput Biol 14:e1006390
Peng, Yifan; Rios, Anthony; Kavuluru, Ramakanth et al. (2018) Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database (Oxford) 2018:
Kwon, Dongseop; Kim, Sun; Wei, Chih-Hsuan et al. (2018) ezTag: tagging biomedical concepts via interactive learning. Nucleic Acids Res 46:W523-W529
Rios, Anthony; Kavuluru, Ramakanth; Lu, Zhiyong (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 34:2973-2981
Mao, Yuqing; Lu, Zhiyong (2017) MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank. J Biomed Semantics 8:15
Liu, Xiaoxia; Yang, Zhihao; Lin, Hongfei et al. (2017) DIGNiFI: Discovering causative genes for orphan diseases using protein-protein interaction networks. BMC Syst Biol 11:23

Showing the most recent 10 out of 51 publications