This project constitutes an effort to increase the speed of the BLAST algorithms while retaining comparable sensitivity to weak similarities. The approach involves two changes to the basic algorithm. 1) BLAST often finds several alignments involving a single database sequence which, when considered together, are significant. Missing any one of these alignments may compromise the combined result. The new """"""""two-pass"""""""" method uses a higher T parameter for an initial scan of the database. For each database sequence, the discovery of a single alignment with sufficient score triggers a second scan of that sequence using a lower T. This decreases average search time, but retains the ability to find significant results involving several weak alignments. 2) In order to find significant alignments, BLAST first seeks short (typically three-letter) word pairs whose aligned score is at least T. Each such """"""""hit"""""""" is then extended to test whether it is contained in a high-scoring alignment. For the default T value, this extension step consumes the great majority of the processing time used by the old BLAST program. The """"""""two-hit"""""""" method requires the location of two non-overlapping word pairs on the same diagonal, and within a distance A of one another, before either pair is expanded. In order to achieve comparable sensitivity, the threshold parameter T must be lowered, yielding more hits than previously. However, only a small fraction of these hits are expanded; the result is a decrease in the average amount of computation required. This work was presented in March in invited talks at the Second Sandia National Laboratories Workshop on Computational Molecular Biology, in Albuquerque, NM, and at the International Symposium on Theoretical and Computational Genome Research, in Heidelberg, Germany.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000072-02
Application #
6162806
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
1997
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24
Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9
Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11