Technological advances have enabled researchers to determine the chemical composition (sequence) of millions of different proteins from thousands of organisms. However, making use of this information in applications such as medicine and agriculture requires additional work to determine the functions of these proteins-what they do at the molecular and organism levels. The first step in determining function of a new protein is to compare its sequence to other proteins whose functions have been analyzed experimentally. Under previous funding, computational methods for reconstructing the evolutionary history of each group of related proteins developed and this history was used to suggest functions that have been conserved during evolution and provide the starting point for protein function analysis. The next step in this work is to enable application to the millions of protein sequences available, and to extend the method to include more detailed information about protein function. This project will develop and test a practical, production-grade implementation of our method, and apply it to UniProt, the world's largest database of protein sequences. UniProt is publicly available, so results will be broadly available and usable by both scientists and non-scientists alike. Educational materials will be developed to help make the results more accessible to students and non-scientists.
The UniProt protein knowledgebase aims to maximize the utility of protein sequence data to the scientific community by representing not only the sequences themselves, but also annotations: metadata describing information that can be inferred about those sequences, such as predicted protein function. The current approach to large-scale annotation of UniProt, called UniRule, relies on ad hoc rules to define sets of proteins that should be annotated similarly. While these rules implicitly utilize information about evolutionary relationships (e.g. membership in a protein family), they do not model function evolution explicitly and are therefore limited in the specificity of annotations they can express. This project implements an explicit evolutionary approach to large-scale sequence annotation, building upon previous work 1) on evolutionary modeling of gain and loss of protein functions (represented as terms from the Gene Ontology) in gene families, and 2) on software to reconstruct the evolutionary history any arbitrary protein sequence by placing it in the context of a phylogenetic tree. Production-level implementation of this approach within the UniProt resource will integrate the large-scale annotation systems already used in the UniProt and Gene Ontology projects, and result in increased specificity and coverage of annotations in the UniProt knowledgebase. The project will significantly improve annotations on tens of millions of sequences in the UniProt knowledgebase, impacting the massive UniProt user base. It will also provide the annotations for Gene Ontology-based analyses of the large number of fully sequenced genomes in UniProt, making such analyses more broadly available. A new online educational module for protein function evolution will curate a learning path starting with this module and including other available online modules. The results will be available in the UniProt resource (uniprot.org), and all software and annotation metadata will be available at pantree.org.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.