This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year includes: a) The implementation of a more accurate method to assess statistical significance in the context of the BLAST and PSI-BLAST database search programs. The assignment of E-values in the BLAST family of programs has depended upon the use of a standard composition for database sequences. This can result in alignments involving sequences with similarly biased compositions receiving inappropriately low E-values. A new approach re-estimates the relevant statistical parameters for each pair of sequences that yield a seemingly significant alignment. The new parameters lead to a revised estimate of statistical significance. This can have a major effect on the output of PSI-BLAST, where the inclusion of a false positive during one iteration can corrupt all further results. The new approach has been implemented and tested for both BLAST and PSI-BLAST, and is now available on the NCBI web site. A substantial decrease in the number of false positive results is apparent. b) The implementation of a fast and accurate method for extracting maximum-likelihood estimates of statistical parameters for local alignment scores. Based upon ideas introduced by Waterman & Vingron, and further developed by Olsen, Bundschuh & Hwa, we have developed a new island method for estimating statistical parameters for local alignment score distributions has been described and implemented. In contrast to the direct method previously in most common use, the new method has several advantages:i) It renders explicit the tradeoff between parameter estimate bias and stochastic error, and allows this tradeoff to be easily controlled;ii) It allows parameter estimates to be obtained for arbitrary length sequence comparisons, including the infinite-length limit;iii) It estimates accurately the tail behavior of score distributions for small-length comparisons.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000014-09
Application #
6432746
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
9
Fiscal Year
2000
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Stojmirovic, Aleksandar; Gertz, E Michael; Altschul, Stephen F et al. (2008) The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 24:i15-23
Yu, Yi-Kuo; Gertz, E Michael; Agarwala, Richa et al. (2006) Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 34:5966-73
Yu, Yi-Kuo; Altschul, Stephen F (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21:902-11
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93
Altschul, S F; Bundschuh, R; Olsen, R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29:351-61
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11