Massive amounts of protein sequence data are derived from such large-scale DNA sequencing efforts as the Human Genome Project. However, the rate of new protein sequences is growing exponentially relative to that of protein structures being solved by experimental methods. Once a protein sequence has been determined, deducing its unique three-dimensional native structure is a daunting task. Such experimental methods to determine detailed protein structure as x-ray diffraction studies and nuclear magnetic resonance analysis are highly labor intensive. Computational approaches offer a considerably faster and cheaper alternative. However, despite substantial efforts, the protein folding still remains largely unsolved.
Prediction of the three-dimensional structure greatly benefits from information related to secondary structure, solvent accessibility, and non-local bonds that stabilize a protein's structure. Thus, prediction of such components is vital to our understanding of the structure and function of a protein. In this context, the investigators focus on protein secondary structure prediction, beta-sheet topology prediction, and contact map prediction. They study a Bayesian beta-sheet model to characterize the non-local interactions in beta-sheets. Starting with the secondary structure and solvent accessibility predictions, the model scores possible beta-sheet architectures by considering groupings of beta-strands into beta-sheets, spatial arrangement of beta-strands in each sheet, interaction types of beta-strands, and base pairing patterns of the amino acid residues on interacting segment pairs.
The investigators also improve the accuracy of secondary structure predictions by modeling the non-local interactions in beta-sheets. This is achieved by a two stage approach. First, suboptimal segmentations of secondary structure are generated from a hidden semi-Markov model. Then, each segmentation score is updated using the beta-sheet model. Finally, the prediction is computed by applying a voting procedure.