The tertiary and secondary structure of a protein has long been thought to be determined by its primary amino acid sequence. Yet the rules relating structure to sequence have resisted explanation. Our approach to this problem begins with the growing database of crystallographically determined protein structure and looks for statistical patterns which might give clues to this unexplained relationship. We have developed two approaches for secondary structure prediction from local sequence. One approach incorporates contributions from all 400 (=20x20) amino acid residue pairs to predict alpha-helix, beta-sheet or coil states. After extensive computation, we have shown this approach to attain up to 66% prediction accuracy. Another approach uses a nonparametric statistical technique to estimate the probability of each structural state given its local sequence and attains nearly the same result. Further improvement in the prediction accuracy seems to be limited primarily by the database size, not by the flexibility of the prediction model. Computer graphics representations of a protein are a valuable and beautiful tool for understanding the 3-dimensional relationships between its components. Yet such representations are not ideally suited for understanding the general patterns of structure underlying an entire database of proteins. We are investigating several alternative representations of protein structure (contact maps, contact graphs, C- alpha distance plots). Such representations can be analyzed visually to construct a novel classification of proteins in the Protein Data Bank. Successful prediction of protein class, domain type or motif from sequence can almost certainly aid secondary structure prediction, but first an optimized, objective mathematical definition of protein class is required. This project is a first step in this direction. The organization of the bases in DNA on long-distance scales is often though to be essentially random. Yet some investigators (Nature 356:168, 1992) have recently noted correlations in base usage at ranges much larger than the length of a single gene. We extended the length range over which correlations can be observed by looking at the first fully sequenced yeast chromosome. Apparent correlations exist up to 65 Kbase ranges, and are statistically significant up to 8 Kbase. These correlations may reflect some feature of evolution of the chromosome or of DNA packing constraints. By identifying these correlations, one expects to identify previously neglected features of genome organization and improve perdition of genetically coded structures.

Agency
National Institute of Health (NIH)
Institute
Center for Information Technology (CIT)
Type
Intramural Research (Z01)
Project #
1Z01CT000226-03
Application #
3774962
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
3
Fiscal Year
1993
Total Cost
Indirect Cost
Name
Center for Information Technology
Department
Type
DUNS #
City
State
Country
United States
Zip Code