The BLAST family of protein and DNA database search programs constitute one of the key services offered by the NCBI. These programs are currently run on NCBI servers about 70,000 times during a typical weekday. This project represents an ongoing effort to improve and extend the functionality of these programs. Efforts this year have focused on the improvement of the PSI-BLAST program: PSI-BLAST searches a database of protein sequences using aposition-specific score matrix (PSSM) as query. The PSSMs used are generally constructed on the fly, through multiple iterations of database searching, initiated with a standard protein sequence. PSI-BLAST has been widely used to annotate proteins inferred from new DNA sequences, and to generate sets of PSSMs representing large classes of proteins. In order to improve the sensitivity of the PSI-BLAST program to distant sequence relationships, we developed a system to evaluate the program's performance. For a set of about 100query sequences, experts in the group compiled an exhaustive list of related proteins in yeast. The queries can then be compared to a comprehensive protein sequence database through an arbitary number of PSI-BLAST iterations, and the resulting PSSM compared to the complete yeast sequence. This procedure generates a list of yeast sequences ordered by E-value, from which a plot of false positives vrs. true positives may be obtained. We used our evaluation system to improve the average sensitivity of PSI-BLAST to distant relationships. The changes adopted include:1) Filtering the database sequences rather than the query for segments of restricted amino acid composition;2) Calculating E-values based upon the composition of the database sequence hit rather than upon a standard protein amino acid composition;3) Letting gaps in a given alignment column render the projected amino acid frequencies for that column closer to background frequencies;4) Decreasing the pseudo count constant from 10 to 7;5) Increasing the percent difference from other sequences required for inclusion in the multiple alignment from 2% to 5%.Most of these changes have been incorporated into the version of PSI-BLAST now available over the public NCBI web page, and the remaining changes will be made available at the time of publication. The new program is much less likely to return false positives, with spurious low E-values.
Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24 |
Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41 |
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9 |
Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005 |
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11 |