A great challenge in the biomedical informatics domain is to develop computational methods that combine existing knowledge and experimental data to derive new knowledge regarding biological systems and disease mechanisms. Most knowledge regarding genes and proteins in biomedical literature is stored in the form of free text that is not suitable for computation, and the manual processes of encoding this body of knowledge into computable form cannot keep up with the rate of knowledge accumulation. The main thrust of the proposed research is to design novel statistical text-mining algorithms to acquire and represent knowledge regarding genes and proteins from free-text literature, and further to combine this acquired knowledge with experimental data to derive new knowledge. We will organize the proposed research to the following specific aims.
Specific Aim 1. Develop ontology-guided semantic modeling algorithms for extracting biological concepts from free text, in which we will design hierarchical probabilistic topic models that are capable of representing biological concepts as a hierarchy and develop novel learning algorithms to infer biological concepts from free-text documents.
Specific Aim 2. Integrate semantic modeling with BioNLP to extract textual evidence supporting protein-function annotations. We will develop information extraction algorithms that will combine the results of hierarchical semantic analysis and BioNLP to identify the text regions that will most likely provide evidence regarding the function of genes/proteins and map the extracted information to a controlled vocabulary.
Specific Aim 3. Develop a framework to unify the procedures of knowledge reasoning and data mining for knowledge discovery. In this aim, we will reason using existing knowledge (represented in the form of an ontology) to reveal functional modules among the genes from the experimental data. We will then further develop algorithms that will reveal relationships between these gene modules by mining system-scaled experimental data. The overall framework will integrate functional reasoning and data mining in an iterative manner to refine the knowledge progressively and to derive rules such as: when genes involved in biological process X are perturbed, genes involved in biological process Y will respond. We will test the framework on the data from yeast-system biology studies and the Cancer Genome Atlas (TCGA) project to gain insights into the cellular systems and disease mechanisms of cancer cells.

Public Health Relevance

In recent decades, biomedical sciences have achieved significant advances;most of the knowledge resulting from research is stored in the form of biomedical literature in the form free-text. This project develop computational approaches to extract knowledge from biomedical literature, represent the knowledge in computable form, and combined the knowledge with experiment data to gain insights into biological systems and disease mechanisms

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (01))
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Schools of Medicine
United States
Zip Code
Chen, Vicky; Paisley, John; Lu, Xinghua (2017) Revealing common disease mechanisms shared by tumors of different tissues of origin through semantic representation of genomic alterations and topic modeling. BMC Genomics 18:105
Huang, Tianzhi; Alvarez, Angel A; Pangeni, Rajendra P et al. (2016) A regulatory circuit of miR-125b/miR-20b and Wnt signalling controls glioblastoma phenotypes through FZD6-modulated pathways. Nat Commun 7:12885
Chen, Lujia; Cai, Chunhui; Chen, Vicky et al. (2016) Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics 17 Suppl 1:9
Lu, Songjian; Cai, Chunhui; Yan, Gonghong et al. (2016) Signal-Oriented Pathway Analyses Reveal a Signaling Complex as a Synthetic Lethal Target for p53 Mutations. Cancer Res 76:6785-6794
Lu, Songjian; Mandava, Gunasheil; Yan, Gaibo et al. (2016) An exact algorithm for finding cancer driver somatic genome alterations: the weighted mutually exclusive maximum set cover problem. Algorithms Mol Biol 11:11
Lu, Songjian; Lu, Kevin N; Cheng, Shi-Yuan et al. (2015) Identifying Driver Genomic Alterations in Cancers by Searching Minimum-Weight, Mutually Exclusive Sets. PLoS Comput Biol 11:e1004257
Ogoe, Henry A; Visweswaran, Shyam; Lu, Xinghua et al. (2015) Knowledge transfer via classification rules using functional mapping for integrative modeling of gene expression data. BMC Bioinformatics 16:226
Cai, Chunhui; Chen, Lujia; Jiang, Xia et al. (2014) Modeling signal transduction from protein phosphorylation to gene expression. Cancer Inform 13:59-67
Lu, Songjian; Lu, Xinghua (2013) Using graph models to find transcription factor modules: the hitting set problem and an exact algorithm. Algorithms Mol Biol 8:2
Mowrey, David; Cheng, Mary Hongying; Liu, Lu Tian et al. (2013) Asymmetric ligand binding facilitates conformational transitions in pentameric ligand-gated ion channels. J Am Chem Soc 135:2172-80

Showing the most recent 10 out of 21 publications