The BLAST family of protein and DNA database search programs? constitute one of the key services offered by the NCBI. These? programs are currently run on NCBI servers about 200,000 times? during a typical weekday. This project represents an ongoing? effort to improve and extend the functionality of these programs.? Improvements this year have centered on the blastp and tblastn? programs.? ? The blastp program was modified to allow it to use compositionally? adjusted scoring matrices as an alternative to the compositional? scaling that has been available for five years. This permits the? substitution matrix used to score alignments to be adjusted so? that it is consistent with the compositions of the sequences being? compared. A study we published this year shows that compositional? matrix adjustment is recommended only under certain conditions, so? it may be invoked either universally or conditionally. A further? study has shown that the use of neither compositional scaling nor? compositional adjustment yields very unreliable statistics, so? compositional scaling has been adopted as the default behavior? for blastp. After further experience, we may change the default? behavior to conditional compositional matrix adjustment.? ? The program tblastn has been modified so that its substitution? matrix may be modified by either compositional scaling or conditional? compositional matrix adjustment. Because the query is a DNA sequence? that is conceptually is translated in six frames, at least five of? which are usually incorrect, matrix modification requires the? definition of a sequence window from which to calculate sequence? composition. We have constructed a way to define such a window? that yields good empirical results. Our studies have shown that? either type of substitution matrix modification yields statistics? that are much more accurate than those of the baseline program,? with only a minor attendant decrease in retrieval accuracy.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000072-11
Application #
7316249
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
11
Fiscal Year
2006
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24
Gertz, E Michael; Yu, Yi-Kuo; Agarwala, Richa et al. (2006) Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol 4:41
Altschul, Stephen F; Wootton, John C; Gertz, E Michael et al. (2005) Protein database searches using compositionally adjusted substitution matrices. FEBS J 272:5101-9
Schaffer, A A; Aravind, L; Madden, T L et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994-3005
Schaffer, A A; Wolf, Y I; Ponting, C P et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000-11