Mining useful knowledge from the biomedical literature holds potentials for facilitating literature search, biological database curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, in PubMed queries a disease name often co-occurs with gene/proteins and drug names. Our own research in the past has focused on identifying genes and diseases in PubMed citations. In particular, in 2010 we co-organized BioCreative III: an international challenge event for engaging the text mining community on finding gene/protein entities in full-length articles from the PubMed Central. Despite efforts and advances, gene name normalization (mapping a gene name to a database identifier) remains a challenging task. Partly, it is due to the difficulty in finding and associating the recognized gene name with its corresponding species. This problem arises because species information is often not explicitly stated next to the gene/protein mentions or completely missing in an article. Hence, it requires automatic methods to infer such information when it is not readily available. To this end, we have developed an open source tool called SR4GN for species recognition and disambiguation in the context of gene normalization. SR4GN significantly extends our previous work via a set of new heuristics for identifying focus species in an article and inferring species when such information cannot be found. According to our evaluation on several benchmark datasets, SR4GN achieves state-of-the-art performance and compares favorably to other similar systems. Another research on entity recognition this year lies in our work on normalizing drug names in PubMed Health drug monographs. Specifically, we developed an automatic pipeline for identifying a drug concept in RxNorm (a standardized drug vocabulary) based on its ingredient and dose form (the physical form a drug is produced and dispensed) in free text. Drug ingredient information was directly parsed from the monograph title. As for the dose form, heuristic rules and patterns were developed to extract relevant information from the body of the full-text monographs. Compared with a simple lookup method, our method shows significant improvement in F-measure. As a result, this research is employed to compute a list of drug brand names for each drug monograph in PubMed Health. Its results have been deployed and indexed in PubMed Heath to facilitate user access to relevant drug pages through drug brands (e.g. searching Tylenol to see the information on Acetaminophen). In 2011, we also explored means for automatically identifying relationships between various biological entities as an effort to build an end-to-end system that includes both entity recognition and relationship extraction. In this research, we used the data from the 4th i2b2 challenge comprising a corpus of fully de-identified medical records with manually annotated information for clinical concepts (e.g. medical problems) and relationships (e.g. treatments improve medical problems). Machine learning was our main approach for this task. However unlike the traditional bag-of-words feature representation, we represented a relationship with a scheme of five distinct context-blocks determined by the position of two potentially related concepts in the text: the introductory, first concept, connective, second concept, and conclusive block. Experimental results showed that when used with SVM, this new context-block representation outperformed the traditional bag-of-words model. Our further analysis suggested that the advantage of such a representation is its capability in automatically capturing the relative word positives between concepts, which has been found critical in other studies as well.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 40:D13-25
Wei, Chih-Hsuan; Kao, Hung-Yu; Lu, Zhiyong (2012) SR4GN: a species recognition software tool for gene normalization. PLoS One 7:e38460
Li, Jiao; Lu, Zhiyong (2012) Automatic identification and normalization of dosage forms in drug monographs. BMC Med Inform Decis Mak 12:9
Li, Jiao; Lu, Zhiyong (2012) Systematic identification of pharmacogenomics information from clinical trials. J Biomed Inform 45:870-8
Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database (Oxford) 2012:bas026
Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 39:D38-51
Do?an, Rezarta Islamaj; Névéol, Aurélie; Lu, Zhiyong (2011) A textual representation scheme for identifying clinical relationships in patient records. Proc Int Conf Mach Learn Appl 2010:995-998
Huang, Minlie; Névéol, Aurélie; Lu, Zhiyong (2011) Recommending MeSH terms for annotating biomedical articles. J Am Med Inform Assoc 18:660-7
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 18 publications