Thanks to continuing developments in DNA sequencing technology, we now know the exact genetic makeup ("genome") of thousands of different organisms, encoding millions of different proteins, and these numbers continue to grow rapidly. But simply knowing the chemical specification (the "sequence") of these proteins is only a first step: the ultimate goal is to discover how genes and proteins function to support the diversity of life, and also how some of them can be used for commercial and biotechnology applications. This research project will expand the capability of scientists and their students to advance their analyses from sequences to functions, by bringing together multiple different state-of-the-art approaches. Each of these approaches uses both computational (necessary to address a problem of this magnitude) and broad biological expertise.
The general approach in this project is to classify proteins into families of related proteins, and, wherever possible, describe how each family relates to function. These relations may be very complex, and scientific accuracy will require application of multiple, diverse methods. In order to accomplish this aim, the project will expand InterPro, a widely used resource that already contains (though with limited integration) three of the leading databases for protein family and functional classification: PANTHER, Pfam and TIGRFAM. A fourth classification resource, the Structure-Function Linkage Database (SFLD), will also be incorporated into InterPro. These four databases use complementary methodologies to represent and describe protein relationships, which will be integrated to address the problem of protein function classification with unprecedented accuracy, precision and ease-of-use. The products of this work will be used to improve sequence analysis tools that support the scientific community, as well as to provide enhanced educational materials, and will be broadly accessible over the web at http://ebi.ac.uk/interpro.