Improvements and Extensions to the Blast Algorithms

Altschul, Stephen

Abstract

The BLAST family of protein and DNA database search programs constitute one of the key services offered by the NCBI. These programs are currently run on NCBI servers about 70,000 times during a typical weekday. This project represents an ongoing effort to improve and extend the functionality of these programs. Efforts this year have focused on the improvement of the PSI-BLAST program: PSI-BLAST searches a database of protein sequences using aposition-specific score matrix (PSSM) as query. The PSSMs used are generally constructed on the fly, through multiple iterations of database searching, initiated with a standard protein sequence. PSI-BLAST has been widely used to annotate proteins inferred from new DNA sequences, and to generate sets of PSSMs representing large classes of proteins. In order to improve the sensitivity of the PSI-BLAST program to distant sequence relationships, we developed a system to evaluate the program's performance. For a set of about 100query sequences, experts in the group compiled an exhaustive list of related proteins in yeast. The queries can then be compared to a comprehensive protein sequence database through an arbitary number of PSI-BLAST iterations, and the resulting PSSM compared to the complete yeast sequence. This procedure generates a list of yeast sequences ordered by E-value, from which a plot of false positives vrs. true positives may be obtained. We used our evaluation system to improve the average sensitivity of PSI-BLAST to distant relationships. The changes adopted include:1) Filtering the database sequences rather than the query for segments of restricted amino acid composition;2) Calculating E-values based upon the composition of the database sequence hit rather than upon a standard protein amino acid composition;3) Letting gaps in a given alignment column render the projected amino acid frequencies for that column closer to background frequencies;4) Decreasing the pseudo count constant from 10 to 7;5) Increasing the percent difference from other sequences required for inclusion in the multiple alignment from 2% to 5%.Most of these changes have been incorporated into the version of PSI-BLAST now available over the public NCBI web page, and the remaining changes will be made available at the time of publication. The new program is much less likely to return false positives, with spurious low E-values.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Intramural Research (Z01)
Project #: 1Z01LM000072-05
Application #: 6432754
Study Section: (CBB)

Project Start
Project End
Budget Start
Budget End
Support Year: 5
Fiscal Year: 2000
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects

Publications

Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24

Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41

Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9

Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005

Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11

Comments

Be the first to comment on Stephen Altschul's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: