Multiple Representations of Biological Sequences

Brutlag, Douglas

Abstract

The long term goal of our research is to understand the flow of information from the genome to the phenotype of organisms. In this proposal, we will attempt to use Bayesian networks and near-optimal sequence alignments to represent protein secondary structures and motifs. A Bayesian network describes the likelihood of amino acids at each position in a motif as well as the dependence of amino acids in one position on the amino acids at other position. Hence, Bayesian networks can describe both the conservation of amino acids at single positions and the conservation of correlations between two positions simultaneously. Conserved amino acids result from evolutionary selection for a specific amino acid or type of amino acid at one position in a protein structure. These positions often have important functional or structural requirements. Correlated changes between amino acids generally result from side-chain side-chain interactions between pairs of amino acids in a protein's structure. The types of correlations we have represented with Bayesian networks include electrostatic charges, hydrophobicity, hydrogen- bond donor and acceptor and inversely correlated packing volumes among others. These Bayesian networks can be used to 1) discover side-chain side--chain interactions within protei motifs and 2) to search sequence databases for motifs showing both correlations and conserved amino acids. Near-optimal alignments between two sequences can display regions that have been more highly conserved or less highly conserved using the information contained in only two sequences. The most highly conserved region correspond to the most highly structured regions and the most highly variable regions correspond to loops and coils and other hypervariable regions. We propose to use near-optimal alignments to display conserved secondary structures of proteins and hypervariable regions. We will use secondary-structure specific amino acid substitution matrices to provide specificity. The goals of this proposal are to 1) build a database of Bayesian networks that represent protein motifs, 2) test these networks for their ability to detect motifs using test sets and crossvalidation methods, 3) compare these networks with other methods for searching protein databases , 4) build an integrated set of Bayesian networks to predict protein secondary structure, 5) compare the prediction of protein secondary structure with existing method 6) build a near-optimal sequence alignment workbench, and 7) predict structured and unstructured regions in proteins from near- optimal alignments.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM005716-04
Application #: 2519669
Study Section: Biomedical Library and Informatics Review Committee (BLR)

Project Start: 1994-09-01
Project End: 1999-08-31
Budget Start: 1997-09-01
Budget End: 1998-08-31
Support Year: 4
Fiscal Year: 1997
Total Cost
Indirect Cost

Institution

Name: Stanford University
Department: Biochemistry
Type: Schools of Medicine
DUNS #: 800771545

City: Stanford
State: CA
Country: United States
Zip Code: 94305

Related projects


NIH 1999 R01 LM	Multiple Representations of Biological Sequences Brutlag, Douglas L. / Stanford University
NIH 1998 R01 LM	Multiple Representations of Biological Sequences Brutlag, Douglas L. / Stanford University
NIH 1997 R01 LM	Multiple Representations of Biological Sequences Brutlag, Douglas L. / Stanford University
NIH 1996 R01 LM	Multiple Representations of Biological Sequences Brutlag, Douglas L. / Stanford University
NIH 1995 R01 LM	Multiple Representations of Biological Sequences Brutlag, Douglas L. / Stanford University
NIH 1994 R01 LM	Multiple Representations of Biological Sequences Brutlag, Douglas L. / Stanford University

Publications

Liu, X; Brutlag, D L; Liu, J S (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput :127-38

Schmidler, S C; Liu, J S; Brutlag, D L (2000) Bayesian segmentation of protein secondary structure. J Comput Biol 7:233-48

Wu, T D; Nevill-Manning, C G; Brutlag, D L (2000) Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics 16:233-44

Wu, T D; Nevill-Manning, C G; Brutlag, D L (1999) Minimal-risk scoring matrices for sequence analysis. J Comput Biol 6:219-35

Singh, A P; Latombe, J C; Brutlag, D L (1999) A motion planning approach to flexible ligand binding. Proc Int Conf Intell Syst Mol Biol :252-61

Wu, T D; Schmidler, S C; Hastie, T et al. (1998) Modeling and superposition of multiple protein structures using affine transformations: analysis of the globins. Pac Symp Biocomput :509-20

Brutlag, D L (1998) Genomics and computational molecular biology. Curr Opin Microbiol 1:340-5

Nevill-Manning, C G; Wu, T D; Brutlag, D L (1998) Highly specific protein sequence motifs for genome analysis. Proc Natl Acad Sci U S A 95:5865-71

Nevill-Manning, C G; Sethi, K S; Wu, T D et al. (1997) Enumerating and ranking discrete motifs. Proc Int Conf Intell Syst Mol Biol 5:202-9

Singh, A P; Brutlag, D L (1997) Hierarchical protein structure superposition using both secondary structure and atomic representations. Proc Int Conf Intell Syst Mol Biol 5:284-93

Showing the most recent 10 out of 13 publications

Comments

Be the first to comment on Douglas Brutlag's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: