A new Gibbs sampling algorithm is described that detects motif-encoding in sequences and optimally partitions them into distinct motif models; this is illustrated using a set of immunoglobulin fold proteins. This algorithm extends previous work in this area in three ways: 1) The requirement for the specification of the number of motifs in each sequence has been relaxed. 2) The length of the motif is now automatically determined by the algorithm, 3) A non-parametic test for the significance of the alignment has been developed. When applied to sequences sharing a single motif, the sampler can be used to classify regions into related submodels, as is illustrated using helix-turn-helix DNA-binding proteins. This feature permits the algorithm to simultaneously align the sequences and classify segments into sub models. Other statistically-based procedures are described for searching a database for sequences matching motifs found by the sampler. When applied to a set of thirty-two very distantly related bacterial integral outer membrane proteins, the sampler revealed that they share a subtle, repetitive motif. The broad conservation and structural location of these repeats suggests that they play important functional roles.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000067-03
Application #
6162804
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
3
Fiscal Year
1997
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code