This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Advances this year include: a) The publication of a scoring system for molecular sequence comparison that is sensitive to similarities at all evolutionary distances, including an analysis of its statistics: This work was completed mainly in the previous year, but was published this year. It details how a single """"""""amino acid substitution matrix"""""""" is best adapted to detecting similarities at a single evolutionary distance, and describes how multiple matrices may be used to cover the complete range of detectable similarities. The statistics of this multiple matrix comparison method are studied (Altschul, 1993). b) Statistics for the sum of the scores of high-scoring segment pairs: In collaboration with Samuel Karlin, I have described the statistical behavior of Sr, the sum of the scores of the highest-scoring distinct segment pairs (Karlin & Altschul, 1993). These statistics are the first rigorous approach to the statistics of scored alignments with gaps. A program to calculate the distribution of Sr, involving a double integral, has been developed with the assistance of Warren Gish and John Spouge. c) The development of Poisson and sum statistics for consistent high-scoring segment pairs: Comparison of protein of DNA sequences frequently yields multiple high-scoring segment pairs. A combined assessment of these segment pairs generally is appropriate only when they may be combined, with the introduction of gaps, into a single consistent alignment. This requires a modification of the sum statistics just described, and of the Poisson probability for finding at least distinct segment pairs with score at least S. The imposition of consistency at once weeds out many """"""""chance"""""""" alignments, and increases the reported significance of the true ones. The statistics of consistent segment pairs have now been described (Karlin & Altschul, 1993), and they have be incorporated into the BLAST programs.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000014-03
Application #
3759301
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
3
Fiscal Year
1994
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Stojmirovic, Aleksandar; Gertz, E Michael; Altschul, Stephen F et al. (2008) The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 24:i15-23
Yu, Yi-Kuo; Gertz, E Michael; Agarwala, Richa et al. (2006) Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 34:5966-73
Yu, Yi-Kuo; Altschul, Stephen F (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21:902-11
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93
Altschul, S F; Bundschuh, R; Olsen, R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29:351-61
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11