Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. Our own research in the past has mostly focused on identifying genes and species in PubMed citations. In 2011-2012, while continuing our efforts in improving gene name recognition, we also turned our attention to disease name detection. Like genes, disease names are also irregular and ambiguous, making them difficult to be identified through simple dictionary look-up methods and an interesting task for the text-mining community. However, due to the lack of adequate training data, there has not been much work focused on disease name identification. To this end, we created a large-scale disease corpus consisting of 6,900 disease names in 793 PubMed abstracts. Developed by a team of 12 annotators (two people per annotation), our data corpus contains rich annotations for every disease occurrence in PubMed abstracts. Furthermore, disease names are categorized into four distinct groups: Specific Disease, Disease Class, Composite Mention and Disease Modifier. When used as the gold standard data for training state-of-the-art machine-learning algorithms, significantly higher performance was found on our data than an existing one with limited annotations. Such characteristics make our disease name corpus a valuable resource for mining disease-related information from biomedical text. Following named entity recognition, we also continued our research from previous years for automatically identifying relationships between various biological entities as an effort to build an end-to-end system that includes both entity recognition and relationship extraction. This year, our research emphasized on extracting pharmacogenomics (PGx) information from free text. Specifically, we developed a systematic approach to automatically identify PGx relationships between genes, drugs and diseases from trial records in In our evaluation, we found that our extracted relationships overlap significantly with the curated factual knowledge through the literature in a PGx database and that most relationships appear on average 5 years earlier in clinical trials than in their corresponding publications, suggesting that clinical trials may be valuable for both validating known and capturing new PGx related information in a more timely manner. Furthermore, two human reviewers judged a portion of computer-generated relationships and found an overall accuracy of 74% for our text-mining approach. This work has practical implications in enriching our existing knowledge on PGx gene-drug-disease relationships as well as suggesting crosslinks between and other PGx knowledge bases. As mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we conducted two separate investigations, one aiming to understand the needs of the curation community and the other directly improve links between literature and biological data. Together with colleagues outside of the NIH, we organized the BioCreative 2012 workshop on Interactive Text Mining in the Biocuration Workflow, an international event for bringing together the biocuration and text mining communities towards the development and evaluation of interactive text mining tools and systems to improve utility and usability in the biocuration workflow. Specifically, we chaired the Workshop Track II entitled Biocuration Workflows and Text Mining where we invited submissions of written descriptions of curation workflows from expert curated databases. We received seven qualified contributions, primarily from model organism databases such as FlyBase. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a similar study in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the Track II participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage), and ontology concept assignment as those most desired by the biocurators. Our second curation-oriented text mining research focused on directly improving links between literature and biological data. As we all know that in todays biomedical search, high-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labor, which makes it a time-consuming and daunting task. Herein, we analyzed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, less than 50% for PDB and GEO. Furthermore, we showed that text-mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database (Oxford) 2012:bas026
Li, Jiao; Lu, Zhiyong (2012) Systematic identification of pharmacogenomics information from clinical trials. J Biomed Inform 45:870-8
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 40:D13-25
Wei, Chih-Hsuan; Kao, Hung-Yu; Lu, Zhiyong (2012) SR4GN: a species recognition software tool for gene normalization. PLoS One 7:e38460
Li, Jiao; Lu, Zhiyong (2012) Automatic identification and normalization of dosage forms in drug monographs. BMC Med Inform Decis Mak 12:9
Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 39:D38-51
Do?an, Rezarta Islamaj; Névéol, Aurélie; Lu, Zhiyong (2011) A textual representation scheme for identifying clinical relationships in patient records. Proc Int Conf Mach Learn Appl 2010:995-998
Huang, Minlie; Névéol, Aurélie; Lu, Zhiyong (2011) Recommending MeSH terms for annotating biomedical articles. J Am Med Inform Assoc 18:660-7
Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2011) Extraction of data deposition statements from the literature: a method for automatically tracking research results. Bioinformatics 27:3306-12

Showing the most recent 10 out of 18 publications