Knowledge of protein function serves as a corner stone for biomedical research, which is fundamental for understanding biologic systems, the mechanism of disease and ultimately the human health. Decades of biomedical research has accumulated a great wealth of such knowledge available in the form of biomedical literatures. An important task of biomedical informatics is to acquire and represent the knowledge from free text of literatures and transform it to languages that are understandable by computational agents, so that the knowledge can be stored, retrieved and used for knowledge discovery. Currently, all protein annotations are assigned manually which, unfortunately, is extremely labor-intense and cannot keep up the pace of the growth of information. Indeed, with the completion of genome sequences of several model organisms, manual annotation of proteins has already become a major bottleneck between large number of proteins and exploding amount information in biomedical literatures. In this application, we propose to develop methods to facilitate automatic annotation of protein functions based on the functional information buried in the biomedical literature. The proposed methods adapt and extend the state of art probabilistic semantic analysis, information retrieval and machine learning methodologies, which serve as principled approaches to modeling uncertainties in natural language text. The project will develop algorithmic building blocks for a future automatic annotation system such that, when given a brief description of a protein (e.g., a protein name and symbol), it will be capable of retrieving relevant literature articles about the protein, extracting biological concepts from the articles and mapping the concept to a controlled vocabulary. We envision that achieving these goals will result in advances with broader impact which not only facilitate automatic protein annotation but also for biomedical literature indexing-one of the important area of biomedical informatics. The efficient knowledge acquisition and management will enhance biomedical research regarding the mechanisms of diseases and drug discovery.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Lu, Songjian; Lu, Xinghua (2012) Integrating genome and functional genomics data to reveal perturbed signaling pathways in ovarian cancers. AMIA Jt Summits Transl Sci Proc 2012:72-8
Karimzadehgan, Maryam; Zhai, Chengxiang (2012) Integer Linear Programming for Constrained Multi-Aspect Committee Review Assignment. Inf Process Manag 48:725-740
Qin, Tingting; Tsoi, Lam C; Sims, Kellie J et al. (2012) Signaling network prediction by the Ontology Fingerprint enhanced Bayesian network. BMC Syst Biol 6 Suppl 3:S3
Richards, Adam J; Schwacke, John H; Rohrer, Bärbel et al. (2012) Revealing functionally coherent subsets using a spectral clustering and an information integration approach. BMC Syst Biol 6 Suppl 3:S7
Li, Xiaoyun; Bandyopadhyay, Dipankar; Lipsitz, Stuart et al. (2011) Likelihood methods for binary responses of present components in a cluster. Biometrics 67:629-35
Jin, Bo; Chen, Vicky; Chen, Lujia et al. (2011) Mapping annotations with textual evidence using an scLDA model. AMIA Annu Symp Proc 2011:834-42
Cowart, L Ashley; Shotwell, Matthew; Worley, Mitchell L et al. (2010) Revealing a signaling role of phytosphingosine-1-phosphate in yeast. Mol Syst Biol 6:349
Asbury, Thomas M; Mitman, Matt; Tang, Jijun et al. (2010) Genome3D: a viewer-model framework for integrating and visualizing multi-scale epigenomic information within a three-dimensional genome. BMC Bioinformatics 11:444
Richards, Adam J; Muller, Brian; Shotwell, Matthew et al. (2010) Assessing the functional coherence of gene sets with metrics based on the Gene Ontology graph. Bioinformatics 26:i79-87
Jin, Bo; Lu, Xinghua (2010) Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics 26:2445-51

Showing the most recent 10 out of 16 publications