The function of millions of proteins remains unknown, and automated protein function prediction systems have a poor record of performance. We will test hypotheses about protein functional sites by validating high-throughput predictions derived from computational biology techniques through a novel automated system that will mine the literature for targeted information relevant to those predictions. The impact of our work will be to enable large-scale, validated, annotation of protein function and in turn to facilitate progress in tackling drug discovery for treatment of diseases. High-throughput experiments and bioinformatics techniques are creating an exploding volume of data with which we hope to transcribe the genetic blueprints of life. Targeted experiments are required to validate biomedical discoveries from these sources. Fortunately, the information to confirm or refute a prediction is often already available in an existing publication and the biologist can take advantage of this supporting evidence for validation. However, the sheer volume of predictions from high throughput methods exceeds the capacity of researchers to perform even the necessary literature searches. This gap in capacity must be addressed using automated literature mining methods that perform comparably to a human expert;indeed, development of such methods is a grand challenge of modern Biology. We will mine the full text literature to validate computational predictions of functional sites in proteins. The innovations in our approach include: (1) using computational predictions as the context for a literature search;(2) information extraction of protein functional sites from full text journal publications;(3) high-throughput text mining;and (4) using primary information in protein databases to evaluate the methods. Understanding of protein function is a critical bottleneck in the progress of biomedical research. It is time to truly integrate the biological literature into the protein function prediction problem. By doing so, we will enable a critical advance in high-throughput protein function prediction

Public Health Relevance

The goals of this research are to test hypotheses about protein functional sites by validating high-throughput predictions derived from computational biology techniques. Our approach is to develop a revolutionary system that will automatically mine the literature for targeted information relevant to those predictions. We will produce reliable protein functional site predictions that can in turn be exploited for in silico high- throughput drug design.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM010120-01
Application #
7724794
Study Section
Special Emphasis Panel (ZLM1-AP-E (M3))
Program Officer
Ye, Jane
Project Start
2009-07-01
Project End
2011-06-30
Budget Start
2009-07-01
Budget End
2010-06-30
Support Year
1
Fiscal Year
2009
Total Cost
$721,448
Indirect Cost
Name
University of Colorado Denver
Department
Pharmacology
Type
Schools of Medicine
DUNS #
041096314
City
Aurora
State
CO
Country
United States
Zip Code
80045
Verspoor, Karin; Mackinlay, Andrew; Cohn, Judith D et al. (2013) Detection of protein catalytic sites in the biomedical literature. Pac Symp Biocomput :433-44
Verspoor, Karin M; Cohn, Judith D; Ravikumar, Komandur E et al. (2012) Text mining improves prediction of protein functional sites. PLoS One 7:e32171
Wall, Michael E; Raghavan, Sindhu; Cohn, Judith D et al. (2011) Genome majority vote improves gene predictions. PLoS Comput Biol 7:e1002284
Lu, Zhiyong; Kao, Hung-Yu; Wei, Chih-Hsuan et al. (2011) The gene normalization task in BioCreative III. BMC Bioinformatics 12 Suppl 8:S2
Verspoor, Karin; Roeder, Christophe; Johnson, Helen L et al. (2010) Exploring species-based strategies for gene normalization. IEEE/ACM Trans Comput Biol Bioinform 7:462-71
Cohen, K Bretonnel; Johnson, Helen L; Verspoor, Karin et al. (2010) The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 11:492