The long term goal of our research is to understand the flow of information from the genome to the phenotype of organisms. In this proposal, we will attempt to use Bayesian networks and near-optimal sequence alignments to represent protein secondary structures and motifs. A Bayesian network describes the likelihood of amino acids at each position in a motif as well as the dependence of amino acids in one position on the amino acids at other position. Hence, Bayesian networks can describe both the conservation of amino acids at single positions and the conservation of correlations between two positions simultaneously. Conserved amino acids result from evolutionary selection for a specific amino acid or type of amino acid at one position in a protein structure. These positions often have important functional or structural requirements. Correlated changes between amino acids generally result from side-chain side-chain interactions between pairs of amino acids in a protein's structure. The types of correlations we have represented with Bayesian networks include electrostatic charges, hydrophobicity, hydrogen- bond donor and acceptor and inversely correlated packing volumes among others. These Bayesian networks can be used to 1) discover side-chain side--chain interactions within protei motifs and 2) to search sequence databases for motifs showing both correlations and conserved amino acids. Near-optimal alignments between two sequences can display regions that have been more highly conserved or less highly conserved using the information contained in only two sequences. The most highly conserved region correspond to the most highly structured regions and the most highly variable regions correspond to loops and coils and other hypervariable regions. We propose to use near-optimal alignments to display conserved secondary structures of proteins and hypervariable regions. We will use secondary-structure specific amino acid substitution matrices to provide specificity. The goals of this proposal are to 1) build a database of Bayesian networks that represent protein motifs, 2) test these networks for their ability to detect motifs using test sets and crossvalidation methods, 3) compare these networks with other methods for searching protein databases , 4) build an integrated set of Bayesian networks to predict protein secondary structure, 5) compare the prediction of protein secondary structure with existing method 6) build a near-optimal sequence alignment workbench, and 7) predict structured and unstructured regions in proteins from near- optimal alignments.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
3R01LM005716-05S1
Application #
6146063
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Project Start
1994-09-01
Project End
1999-12-06
Budget Start
1999-09-01
Budget End
1999-12-06
Support Year
5
Fiscal Year
1999
Total Cost
Indirect Cost
Name
Stanford University
Department
Biochemistry
Type
Schools of Medicine
DUNS #
800771545
City
Stanford
State
CA
Country
United States
Zip Code
94305
Liu, X; Brutlag, D L; Liu, J S (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput :127-38
Schmidler, S C; Liu, J S; Brutlag, D L (2000) Bayesian segmentation of protein secondary structure. J Comput Biol 7:233-48
Wu, T D; Nevill-Manning, C G; Brutlag, D L (2000) Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics 16:233-44
Wu, T D; Nevill-Manning, C G; Brutlag, D L (1999) Minimal-risk scoring matrices for sequence analysis. J Comput Biol 6:219-35
Singh, A P; Latombe, J C; Brutlag, D L (1999) A motion planning approach to flexible ligand binding. Proc Int Conf Intell Syst Mol Biol :252-61
Wu, T D; Schmidler, S C; Hastie, T et al. (1998) Modeling and superposition of multiple protein structures using affine transformations: analysis of the globins. Pac Symp Biocomput :509-20
Brutlag, D L (1998) Genomics and computational molecular biology. Curr Opin Microbiol 1:340-5
Nevill-Manning, C G; Wu, T D; Brutlag, D L (1998) Highly specific protein sequence motifs for genome analysis. Proc Natl Acad Sci U S A 95:5865-71
Nevill-Manning, C G; Sethi, K S; Wu, T D et al. (1997) Enumerating and ranking discrete motifs. Proc Int Conf Intell Syst Mol Biol 5:202-9
Singh, A P; Brutlag, D L (1997) Hierarchical protein structure superposition using both secondary structure and atomic representations. Proc Int Conf Intell Syst Mol Biol 5:284-93

Showing the most recent 10 out of 13 publications