This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year includes: a) A study of the distribution of optimal scores for local alignments allowing gaps. We showed empirically that the characteristic value for this distribution grows linearly with the log of the search space size, and does not require a log-log term. This permits the two relevant statistical parameters to be determined from a random simulation for a single search space size. These parameters were estimated for a number of frequently used amino acid substitution matrices, and a wide range of gap cost penalties. It was also shown that the statistics for the sum of the scores of the r best locally optimal segment pairs may be extended to alignments allowing gaps. These advances permitted the development of a modification of the BLAST database search programs that permits gaps and reports accurate statistical significances. b) A refinement of the statistical treatment of multiple, distinct, locally optimal subalignments from the comparison of two sequences. When several distinct regions of similarity are shared by two proteins, it is appropriate to construct a combined assessment of their statistical significance. Earlier treatments have allowed the relative orders of corresponding regions within the sequences to be taken into account. The new treatment also permits constraints to be placed upon distances between the conserved regions.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000014-05
Application #
2578619
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
5
Fiscal Year
1996
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Stojmirovic, Aleksandar; Gertz, E Michael; Altschul, Stephen F et al. (2008) The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 24:i15-23
Yu, Yi-Kuo; Gertz, E Michael; Agarwala, Richa et al. (2006) Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 34:5966-73
Yu, Yi-Kuo; Altschul, Stephen F (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21:902-11
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93
Altschul, S F; Bundschuh, R; Olsen, R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29:351-61
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11