The need for rigorous statistics in sequence analysis is now generally conceded, particularly in light of the success of the BLAST suite of programs at NCBI. Insertions and deletions in proteins pose statistical problems in sequence matching, problems that are presently at best only partially solved. Classifying proteins into protein families has been shown to improve dection of distant homologs in the protein database, because it provides a broader picture of motif conservation in a particular protein. Several approaches to protein classification are presently available. Andy Neuwald has pursued a strategy using Gibbs sampling to analyze the motifs in a protein family, but until recently the Gibbs sampler could not take advantage of gapping information. This information can be described as follows. The distance between different sequence elements in a protein motif usually reflect loops between conserved secondary structure elements in the protein, and it is known that the loops often have a tight, well-defined length distribution. The gaps between the motif elements can be included in match scores, but an assessment of the statistical significance of the resulting scores has largely been lacking up to now. With improved combinatoric computational techniques, we discovered some new relationships between distant members of a protein family called the """"""""AAA+"""""""" family. This discovery indicates that current methods of similarity detection are overlooking information in the gap lengths between protein motifs.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000081-02
Application #
6111080
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
1998
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code