One of the principal needs for structural genomics is a methodology for automated protein structure comparison and classification. Earlier, we have developed a tool, Simplicial Neighborhood Analysis of Protein Packing (SNAPP) for the identification of recurrent sequence-structure motifs in a collection of protein structures. We propose systematic application of statistical geometry and geometric pattern matching techniques for the identification of protein family specific packing patterns (family signatures). We further propose to use these signatures for comparison and classification of known 3D protein structures. Finally, we aim to demonstrate that some of these structural patterns can be mapped onto underlying protein sequences forming sequence specific pattern and therefore used also for sequence annotation and classification. We employ a computational geometry technique known as Delaunay tessellation, which partitions protein structures into unique sets of quadruplet contacts. This consideration reduces tertiary structure to a natural basis set of motifs that may be characteristic of protein structural and functional classes. A broader definition of motifs can be obtained by applying frequent common subgraph mining approaches to the collections of protein graphs representing known structural and functional families. To discover structural and functional family specific motifs and apply them towards protein classification and annotation, this proposal is structured around the following Specific Aims:
Aim 1. Develop novel algorithms to identify protein family specific packing motifs based on frequent common subgraph mining of protein graph families;
Aim 2 : Identify specific amino acid packing motifs in diverse protein families and define them as sequence specific signatures;
Aim 3 : Develop methodologies for protein annotation based on family-specific packing motifs. This project benefits from collaborative efforts of four investigators with complimentary expertise in structural bioinformatics (Tropsha), computational geometry (Snoeyink), data mining (Wang), and high-performance computing (Prins). The proposed methodologies are expected to be both robust and efficient to afford their application to large, post-genomic scale databases of protein structures and sequences. The proposed studies shall lead to the discovery of previously unknown patterns of amino acid residues that are important for protein structure and function. Functional annotation of orphan proteins will expand our knowledge of the human proteome. Since proteins are the most typical therapeutic targets, our research aimed at bettering our understanding of the protein structure-function relationships should facilitate the discovery of novel targets for drug therapy thereby contributing to the improvement of human health.