This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Advances this year include a study of the distribution of optimal scores from local alignments allowing gaps, which showed empirically that the characteristic value for this distribution grows linearly with the log of the search space size, and does not require a log-log term. This permits the two relevant statistical parameters to be determined from a random simulation for a single search space size. These parameters were estimated for a number of frequently used amino acid substitution matrices, and a wide range of gap penalties. It was also shown that the statistics for the sum of the scores of the best locally optimal segment pairs (Karlin & Altschul, 1993) may be extended to alignments allowing gaps. These advances permitted the development of a modification of the BLAST database search programs that permits gaps and reports accurate statistical significances. The work was done in collaboration with Warren Gish (Washington University, St. Louis), and is described in a paper soon to appear in Methods in Enzymology.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000014-04
Application #
5203616
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
4
Fiscal Year
1995
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Stojmirovic, Aleksandar; Gertz, E Michael; Altschul, Stephen F et al. (2008) The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 24:i15-23
Yu, Yi-Kuo; Gertz, E Michael; Agarwala, Richa et al. (2006) Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 34:5966-73
Yu, Yi-Kuo; Altschul, Stephen F (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21:902-11
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93
Altschul, S F; Bundschuh, R; Olsen, R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29:351-61
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11