Genome and metagenome projects have revealed the genetic sequence of millions of proteins, whose biological interpretation requires understanding of their function. One of the most successful approaches for predicting proteins'functions is the integration of all available functional data evolutionary relationships in a reconciled phylogenetic tree. This method, known as phylogenomics, has been heralded as highly accurate and conceptually elegant, but its application has been limited by its exquisite dependency upon painstaking analyses by domain experts. We will enhance, assess, and apply a statistical method for predicting protein function using phylogenomic principles. Our approach, known as SIFTER (Statistical Inference of Function Through Evolutionary Relationships) presently exists as a prototype. In this proposal, we will enhance the core algorithms to take account of domain architecture, to become more consistently statistical in its approach, and to accommodate a larger range of possible functions for proteins. We will improve the key internal parameters of the molecular evolution model, and improve interpretability of the results. We will make the program capable of accepting more typical protein sequences for analysis, and of using a wider range of information (including database annotations, sequence &structure motifs) as evidence of function. Ultimately, SIFTER will be capable of incorporating other function prediction approaches within its phylogenetic context. The performance of SIFTER will be rigorously assessed using well-studied families. We will collaborate with major protein databases to deploy SIFTER for medium-scale application in protein annotation. Experimental validation will be essential to truly test SIFTER'S performance and, coincidentally, enrich our biological understanding of several protein families. We will use SIFTER to make an optimal selection of Nudix proteins for experimental characterization. In addition to assaying these proteins, we will also make blind predictions of molecular function of proteins being characterized by structural genomics centers, and we will then biochemically characterize promising candidate proteins provided to us. The completed SIFTER system should provide a significant improvement over current approaches for protein function prediction, of direct relevance to nearly all molecular biologists. The significance of this work for public health is clear and immediate, by unlocking protein function information encoded in genome sequences. These methods will allow understanding of proteins implicated in disease and necessary for health, in humans as well as model organisms. Application of SIFTER will also permit detailed understanding of pathogens'and commensal microbiota's proteins. These methods will be a foundation for the further study of any protein identified through genome projects.
Showing the most recent 10 out of 16 publications