Despite rapid progress in structural bioinformatics, a rigorous and unifying mathematical and statistical framework is missing in our current toolbox for analysis, classification, and organization of individual as well as groups of biomolecules. We have recently developed such a framework based on the elastic shape analysis (ESA) for the comparison of protein and RNA structures. Under this framework, the formal geodesic distance for any two protein/RNA structures can be computed rapidly. Probability distributions can also be built for families of protein/RNA structures, and can be used to classify structures in a principled way through statistical hypothesis testing. In addition, sequence information can be naturally incorporated so that comparison of structures can be conducted in the joint sequence-structure space. We have also developed novel algorithms for matching and analyzing protein surfaces. We propose to significantly further develop these methodologies for important applications in structure biology, including studying chromosome structures by combining both 30 structure and sequence level information. The proposed research will make significant contributions to the following areas: (1) This proposal will fill an important gap in structure biology - the lack of a rigorous mathematical and statistical framework for biomolecular structure comparison; (2) Our proposed unifying framework will allow natural incorporation of sequence information for structure comparison; (3) Our approach can uncover distinct clusters at the deepest level of current classification scheme (i.e. SCOP family), enabling a finer classification of biomolecular structures. Preliminary results indicate that by using carefully measured structural similarity, we will obtain representative sets of proteins of higher quality than those by current sequence similarity based methods; (4) The probabilistic models designed for protein/RNA backbone structures and surfaces will capture the flexible nature of protein structures through the use of ensemble of conformations, while maintaining high computational efficiency. These models will also enable effective characterization of family-specific variations among proteins, an important task none of the existing methods work well; (5) Protein/RNA structures will be organized using network-based data structures using probabilistic approaches. This new organization will effectively integrates sequence, backbone structure, and surface information, facilitating discovery of novel insight; and (6) these new development will be rapidly generalized for studying chromosome structures. This proposed research will allow development of tools that will also be applicable in other areas of shape analysis, including medical image analysis, computer vision, and pattern recognition. Our work will help to increase the communication between the field of protein structure analysis and the field of shape analysis, and will stimulate more cross-over development in methodology and transform research activities in both fields.
Analysis, classification and organization of biomolecules are fundamental tasks essential for understanding the sequence-structure-function relationships of biomolecules. In this project, we aim to develop rigorous and unifying mathematical and statistical frameworks for such tasks and apply them to study proteins, RNAs and chromosomes.