Colorado State University is awarded a grant to develop machine learning methods for predicting protein function. The availability of protein function annotations supports the everyday work of biologists in multiple areas---from biomedical discovery to the study of plant drought resistance, and the design of bacteria useful in biofuel production. Assigning function to proteins in sequenced genomes is a major undertaking, and with new organisms being sequenced daily, experimentally determining the function of all the proteins in those organisms is not practical, requiring computational assignment of function to proteins that have not been studied in the lab. Computational scientists have been considering the problem of function prediction for over two decades. Yet, the basic methodology for protein function prediction has not changed much during this time and remains that of "annotation transfer" from proteins with a known function using a method for sequence comparison such as BLAST. Protein function prediction has several properties that make it difficult to apply state-of-the-art machine learning methods to this problem, such as the large number of potential functions (thousands of possible terms), the fact that proteins can have multiple functions, and the hierarchical relationship between terms in the Gene Ontology (GO), which is the standard system of keywords used to describe protein function. In this work the problem of annotating proteins with GO terms will be explicitly modeled as a hierarchical classification problem using the methodology of "kernel methods for structured outputs", which allows the modeling of complex prediction problems. This methodology will allow the PIs to integrate a variety of genomic information - sequence data, gene expression, protein-protein interactions, and information mined from the biological literature. The award will lead to a function prediction method with state-of-the-art accuracy. The project will have broad impact by providing the GOstruct method to the bioinformatics and biology communities in the form of downloadable software and an online-accessible function prediction server. Education will be impacted through the incorporation of the tool into new courses in programming for biologists and on kernel methods.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
0965768
Program Officer
Peter H. McCartney
Project Start
Project End
Budget Start
2010-06-01
Budget End
2014-05-31
Support Year
Fiscal Year
2009
Total Cost
$523,303
Indirect Cost
Name
Colorado State University-Fort Collins
Department
Type
DUNS #
City
Fort Collins
State
CO
Country
United States
Zip Code
80523