The BLAST family of protein and DNA database search programs constitute one of the key services offered by the NCBI. These programs are currently run on NCBI servers about 70,000 times during a typical weekday. This project represents an ongoing effort to improve and extend the functionality of these programs. Efforts this year have focussed on the improvement of the PSI-BLAST program: PSI-BLAST searches a database of protein sequences using a position-specific score matrix (PSSM) as query. The PSSMs used are generally constructed on the fly, through multiple iterations of database searching, initiated with a standard protein sequence. PSI-BLAST has been widely used to annotate proteins inferred from new DNA sequences, and to generate sets of PSSMs representing large classes of proteins. In order to improve the sensitivity of the PSI-BLAST program to distant sequence relationships, we developed in previous years a system to evaluate the program's performance. For a set of about 100 query sequences, experts in the group compiled an exhaustive list of related proteins in yeast. The queries can then be compared to a comprehensive protein sequence databease through an arbitary number of PSI-BLAST iterations, and the resulting PSSM compared to the complete yeast sequence. This procedure generates a list of yeast sequences ordered by E-value, from which a plot of false positives vrs. true positives may be obtained. We continued to use our evaluation system to test potential improvements to PSI-BLAST in detecting distant relationships, and to compare PSI-BLAST to other related programs. Several avenues were pursued this year: 1) We tested the relative sensitivity of the BLOSUM and OPTIMA scoring systems, and found BLOSUM to be superior; 2) We investigated which parameters relating to the heuristic nature of the BLAST algorithm had the most bearing on PSI-BLAST accuracy, as well as the tradeoff between speed and sensitivity implicit in adjusting these parameters; 3) We compared PSI-BLAST to the related program SAM, and found PSI-BLAST to be both faster and much more accurate in detecting distant relationships; 4) We investigated further the effects of """"""""window-based"""""""" composition calculations, and determined that a larger test set will be required to study this procedure; 5) We began implementation of a """"""""hybrid"""""""" local alignment scoring system that should permit the introduction of position-specific gap costs. On a separate front, we tested the relative sensitivity of using contiguous and non-contiguous """"""""hits"""""""" for BLAST in both the DNA and protein contexts. Non-contiguous hits appear to yield no advantage for protein BLAST searches, but may provide a substantial improvement for DNA BLAST searches.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000072-07
Application #
6681350
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
7
Fiscal Year
2002
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24
Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9
Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11