The BLAST family of protein and DNA database search programs constitute one of the key services offered by the NCBI. These programs are currently run on NCBI servers about 70,000 times during a typical weekday. This project represents an ongoing effort to improve and extend the functionality of these programs. Efforts this year have focussed on the development of the IMPALA program:PSI-BLAST searches a database of protein sequences using a position-specific score matrix (PSSM) as query. The PSSMs used are generally constructed on the fly, through multiple iterations of database searching, initiated with a standard protein sequence. PSI-BLAST has been widely used to annotate proteins inferred from new DNA sequences, and to generate sets of PSSMs representing large classes of proteins. This has created the need for an inverse program, that will search a database of PSI-BLAST generated PSSMs using a standard protein sequence as query. The new IMPALA program answers this need. Because databases of PSSMs will typically be orders of magnitude smaller than standard protein databases, a program such as IMPALA can afford to run much more slowly on each pairwise comparison than the corresponding BLAST program. Accordingly, IMPALA implements the Smith-Waterman algorithm, adapted to sequence-PSSM comparison. One novel feature of IMPALA is its assessment of statistical significance of the alignments produced. For each alignment reported, a new pairwise lambda scale parameter[see PNAS 87:2264-8] is calculated for ungapped alignments. This parameter is used to rescale the PSSM to one with the same lambda used in precomputed gapped- alignment simulations. This approach leads to a substantial reduction in the number of false positive hits at any chosen level of statistical significance. It is in the process of being added to the BLAST and PSI- BLAST programs, and should improve those programs sensitivities. - similarity search, database search, homology, BLAST, PSI-BLAST, PHI- BLAST, IMPALA
Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24 |
Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41 |
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9 |
Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005 |
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11 |