Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our own research in the past has mostly focused on identifying genes and species in PubMed citations. In 2012-2013, we have continued our efforts in improving disease name detection in free text. Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text - the task of disease name normalization - compared with other normalization tasks (e.g. gene normalization) in biomedical text mining research. To this end, we developed the first machine-learning approach for disease name normalization (DNorm), based on our previously developed NCBI Disease corpus. Our technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in very large optimization problems for information retrieval. When comparing our method to traditional approaches based on lexical normalization and matching, DNorm achieved 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Our DNorm system was also applied to the 2013 ShARe/CLEF eHealth Shared Task, a worldwide challenge task of recognizing disease names in 4 different types of clinical notes (e.g. discharge summary). We achieved the highest performance in the disease normalization task among all 16 international participating teams. Besides disease name recognition, we also improved the state of the art in finding mutations in free text. Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy. Hence, we developed tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in traditional methods. Using a novel CRF label model and feature set, tmVar achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we developed PubTator and assessed its use in assisting manual biocuration through participation in BioCreative 2012, an international workshop that brings together the text mining and biology communities to drive the development of text mining systems that can be integrated into the biocuration workflow and the knowledge discovery process. Our PubTator system is a Web-based tool for assisting manual literature curation: a highly expensive endeavor of extracting knowledge from biomedical literature into structured databases. Our PubTator system is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatically computed results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is currently being used in assisting real-life database curation.
Showing the most recent 10 out of 51 publications