This proposal's objective is to develop a new class of statistical models to advance scientific knowledge of protein tertiary structure and to extend template-based modeling to protein loop regions. As advancement in basic science, the improved modeling of protein structure will broadly impact biomedical fields. The following specific aims will be accomplished.
The first aim (Random Partition Models Indexed by Pairwise Information) is to develop probability models for partitions that are explicitly non-exchangeable, utilizing available pairwise information to influence the clustering of data. Four distributions ar proposed, each using the pairwise information by modifying identities from the Chinese Restaurant Process, a popular probability model for clustering. Hierarchical clustering uses pairwise distance, but current methods for protein structure modeling do not. The proposed method provides a means to incorporate this type of information into Bayesian nonparametric models for protein structure.
The second aim (Template-Based Modeling of Loop Conformation Space Using Partition Models) applies the proposed random partition models in loop modeling. This proposal will improve our previous estimation approach by accounting for the influences of individual amino acids as well as for influences from neighboring residues. New methods based on the random partition models will provide rigorous statistical modeling at and between residue positions allowing one to limit and precisely sample the conformational space. This will in turn allow for a clearer understanding of roles of loops in catalytic sites and protein signaling.
The final aim (New Paradigm for Protein Packing and Higher-Order Structure Using Partition Models) applies the statistical modeling to estimate the propensities of a new model of protein packing called the "ball/socket." Statistical modeling of the amino acid propensities within the "ball/socket" motifs and between patterns of motifs will allow insights into the rules governing packing, filling a substantial gap in current understanding of protein structure. The statistical model estimating these propensities will exploit the known pairwise information by using the proposed random partition models. Such analysis is currently not available to the scientific community.

Public Health Relevance

More accurate and improved modeling of protein structure from sequence will greatly aid the biomedical community in a better understanding of disease states. Moreover, producing accurate models of protein structure directly from sequence leverages the vast amounts of genetic information produced by the many genome projects. Accurate protein structure modeling also informs drug discovery by prioritizing targets.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZGM1-CBCB-5 (BM))
Program Officer
Wehrle, Janna P
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Brigham Young University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Li, Qiwei; Dahl, David B; Vannucci, Marina et al. (2014) Bayesian model of protein primary sequence for secondary structure prediction. PLoS One 9:e109832
Joo, Hyun; Tsai, Jerry (2014) An amino acid code for ?-sheet packing structure. Proteins 82:2128-40
Day, Ryan; Joo, Hyun; Chavan, Archana C et al. (2013) Understanding the general packing rearrangements required for successful template based modeling of protein structure from a CASP experiment. Comput Biol Chem 42:40-8