A great challenge in the biomedical informatics domain is to develop computational methods that combine existing knowledge and experimental data to derive new knowledge regarding biological systems and disease mechanisms. Most knowledge regarding genes and proteins in biomedical literature is stored in the form of free text that is not suitable for computation, and the manual processes of encoding this body of knowledge into computable form cannot keep up with the rate of knowledge accumulation. The main thrust of the proposed research is to design novel statistical text-mining algorithms to acquire and represent knowledge regarding genes and proteins from free-text literature, and further to combine this acquired knowledge with experimental data to derive new knowledge. We will organize the proposed research to the following specific aims.
Specific Aim 1. Develop ontology-guided semantic modeling algorithms for extracting biological concepts from free text, in which we will design hierarchical probabilistic topic models that are capable of representing biological concepts as a hierarchy and develop novel learning algorithms to infer biological concepts from free-text documents.
Specific Aim 2. Integrate semantic modeling with BioNLP to extract textual evidence supporting protein-function annotations. We will develop information extraction algorithms that will combine the results of hierarchical semantic analysis and BioNLP to identify the text regions that will most likely provide evidence regarding the function of genes/proteins and map the extracted information to a controlled vocabulary.
Specific Aim 3. Develop a framework to unify the procedures of knowledge reasoning and data mining for knowledge discovery. In this aim, we will reason using existing knowledge (represented in the form of an ontology) to reveal functional modules among the genes from the experimental data. We will then further develop algorithms that will reveal relationships between these gene modules by mining system-scaled experimental data. The overall framework will integrate functional reasoning and data mining in an iterative manner to refine the knowledge progressively and to derive rules such as: when genes involved in biological process X are perturbed, genes involved in biological process Y will respond. We will test the framework on the data from yeast-system biology studies and the Cancer Genome Atlas (TCGA) project to gain insights into the cellular systems and disease mechanisms of cancer cells.

Public Health Relevance

In recent decades, biomedical sciences have achieved significant advances;most of the knowledge resulting from research is stored in the form of biomedical literature in the form free-text. This project develop computational approaches to extract knowledge from biomedical literature, represent the knowledge in computable form, and combined the knowledge with experiment data to gain insights into biological systems and disease mechanisms

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (01))
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Montefusco, David J; Chen, Lujia; Matmati, Nabil et al. (2013) Distinct signaling roles of ceramide species in yeast revealed through systematic perturbation and systems biology analyses. Sci Signal 6:rs14
Lu, Songjian; Jin, Bo; Cowart, L Ashley et al. (2013) From data towards knowledge: revealing the architecture of signaling systems by unifying knowledge mining and data mining of systematic perturbation data. PLoS One 8:e61134
Jin, Bo; Chen, Vicky; Chen, Lujia et al. (2011) Mapping annotations with textual evidence using an scLDA model. AMIA Annu Symp Proc 2011:834-42