Literature and Data Driven Hypothesis Generation for High Throughput Experiments Microarray gene expression analyses are used widely in biomedical research today. Thousands of genes can be assayed in a single experiment, and differences in their expression level observed across some experimental condition variance of interest, such as diseased versus healthy tissue. The difficulty is that there is natural variation in gene expression levels, and experimental differences in samples and microarrays. In consequence, it is hard to know which observed differences are biologically significant and which are just the result of random fluctuations. It is generally accepted that this problem is best addressed by integrating other sources of biological knowledge, such as co-occurrence in the literature, in the Gene Ontology, or in pre-defined gene sets. However, most techniques still produce only a ranked list of genes or gene clusters, and these still require biological interpretation. A biomedical scientist knows well what to do if a single gene, or a set of genes on a known pathway, is shown to be differentially expressed. The difficulty with interpreting the results of high throughput experiments is that the human effort required does not scale to hundreds of genes and, even worse, human expertise cannot be as deep across such a large set of genes as for a particular gene under careful investigation. Most standard computational approaches use bulk manipulation of candidate genes, performing analyses that no biomedical scientist would conduct if a single gene were at hand. The goal of this project is to emulate computationally, for thousands of candidate genes, what a biomedical scientist would want to do for one gene. This involves bringing to bear biological knowledge, as found in the literature and in public databases, to develop biologically sound hypotheses that could explain the observed differential expression. Specifically, we will develop techniques to generate putative pathways dynamically, boot-strapping from observed differential expression data, based upon external evidence of relationship from the literature and from interaction databases. In a separate project, not part of this proposal, we have developed techniques for extraction of gene and protein interaction information from biomedical literature, including important information such as the type of interaction and the experimental conditions. We will exploit this extracted information resource, which currently includes full text of all articles in PubMed Central. The expected output of our algorithm will be a small number of hypothesized pathways that the scientist can choose to evaluate further experimentally.

Public Health Relevance

Bench biology has been transformed through the recent development of high throughput techniques, which permit the scientist to perform thousands of experiments in parallel at low cost. But this has in turn caused interpretation of experimental results to become a bottleneck. This project uses computational techniques to glean biological knowledge from the literature and from public databases to address this challenge.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM010138-02
Application #
7828233
Study Section
Special Emphasis Panel (ZLM1-AP-E (M3))
Program Officer
Ye, Jane
Project Start
2009-07-01
Project End
2012-06-30
Budget Start
2010-07-01
Budget End
2012-06-30
Support Year
2
Fiscal Year
2010
Total Cost
$591,848
Indirect Cost
Name
University of Michigan Ann Arbor
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
073133571
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109
Farfán, Fernando; Ma, Jun; Sartor, Maureen A et al. (2012) THINK Back: KNowledge-based Interpretation of High Throughput data. BMC Bioinformatics 13 Suppl 2:S4
Ma, Jun; Sartor, Maureen A; Jagadish, H V (2011) Appearance frequency modulated gene set enrichment testing. BMC Bioinformatics 12:81