Improvements and Extensions to the Blast Algorithms

Altschul, Stephen

Abstract

The BLAST family of protein and DNA database search programs constitute one of the key services offered by the NCBI. These programs are currently run on NCBI servers about 70,000 times during a typical weekday. This project represents an ongoing effort to improve and extend the functionality of these programs. Efforts this year have focussed on the development of the IMPALA program:PSI-BLAST searches a database of protein sequences using a position-specific score matrix (PSSM) as query. The PSSMs used are generally constructed on the fly, through multiple iterations of database searching, initiated with a standard protein sequence. PSI-BLAST has been widely used to annotate proteins inferred from new DNA sequences, and to generate sets of PSSMs representing large classes of proteins. This has created the need for an inverse program, that will search a database of PSI-BLAST generated PSSMs using a standard protein sequence as query. The new IMPALA program answers this need. Because databases of PSSMs will typically be orders of magnitude smaller than standard protein databases, a program such as IMPALA can afford to run much more slowly on each pairwise comparison than the corresponding BLAST program. Accordingly, IMPALA implements the Smith-Waterman algorithm, adapted to sequence-PSSM comparison. One novel feature of IMPALA is its assessment of statistical significance of the alignments produced. For each alignment reported, a new pairwise lambda scale parameter[see PNAS 87:2264-8] is calculated for ungapped alignments. This parameter is used to rescale the PSSM to one with the same lambda used in precomputed gapped- alignment simulations. This approach leads to a substantial reduction in the number of false positive hits at any chosen level of statistical significance. It is in the process of being added to the BLAST and PSI- BLAST programs, and should improve those programs sensitivities. - similarity search, database search, homology, BLAST, PSI-BLAST, PHI- BLAST, IMPALA

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Intramural Research (Z01)
Project #: 1Z01LM000072-04
Application #: 6290488
Study Section: Special Emphasis Panel (CBB)

Project Start
Project End
Budget Start
Budget End
Support Year: 4
Fiscal Year: 1999
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects

Publications

Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24

Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41

Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9

Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005

Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11

Comments

Be the first to comment on Stephen Altschul's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: