This project is a continuing study of questions concerning what similarities can be expected to occur purely by chance when two protein or DNA sequences are compared. A subsidiary and related question concerns the definition of scoring systems that are optimal for distinguishing biologically meaningful patterns from chance similarities. Work this year includes: a) The definition of a new method for scoring gaps within protein alignments, and the empirical study of the statistics of optimal alignment scores using this scoring system. Based upon the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively non-conserved regions. To take advantage of this structure, a simple generalization of affine gap costs was proposed which allows non-conserved regions to be effectively ignored. The distribution of scores from local alignments using these generalized gap costs was shown empirically to follow an extreme value distribution. In many cases generalized affine gap costs yield superior alignments from the standpoints both of statistical significance and alignment accuracy. Guidelines for selecting generalized affine gap costs were developed. b) The development of statistics for local alignments seeded by a pattern. The recently developed PHI-BLAST program constructs optimal local alignments seeded by a pattern specified by a researcher. The random distribution of these local alignments was studied both analytically and empirically. The statistics developed were incorporated into the PHI-BLAST program, allowing it in many instances to detect significant similarity between homologous proteins that were not recognizably realted using traditional single-pass database search methods.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Intramural Research (Z01)
Project #
Application #
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
United States
Zip Code
Stojmirovic, Aleksandar; Gertz, E Michael; Altschul, Stephen F et al. (2008) The effectiveness of position- and composition-specific gap costs for protein similarity searches. Bioinformatics 24:i15-23
Yu, Yi-Kuo; Gertz, E Michael; Agarwala, Richa et al. (2006) Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches. Nucleic Acids Res 34:5966-73
Yu, Yi-Kuo; Altschul, Stephen F (2005) The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 21:902-11
Yu, Yi-Kuo; Wootton, John C; Altschul, Stephen F (2003) The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A 100:15688-93
Altschul, S F; Bundschuh, R; Olsen, R et al. (2001) The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 29:351-61
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11