Work this year focused on the improvement of PSI-BLAST through the development of new methods for estimating the effective number of independent observations represented in an alignment column, and for calculating the number of pseudocounts that should be employed in constructing PSI-BLAST substitution scores. In brief, PSI-BLAST estimates the probabilities of amino acids occurring in an alignment position by combining N """"""""effective"""""""" observed amino acid counts with n data-dependent pseudocounts. Because of sequence correlations, the number N of independent observations represented by an alignment column is not simply the number of sequences aligned. We have described a logically improved method for estimating N, and have found that its implementation yields improved PSI-BLAST retrieval accuracy. Also, we have developed a method, inspired by the minimum description length (MDL) principle, for adjusting the number of pseudocounts n, as a function of column composition. As suggested by both theory and experiment, n should be larger for more variable positions. These improvements are both now implemented in PSI-BLAST, and used by default.

Project Start
Project End
Budget Start
Budget End
Support Year
14
Fiscal Year
2009
Total Cost
$66,342
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Shah, Nidhi; Altschul, Stephen F; Pop, Mihai (2018) Outlier detection in BLAST hits. Algorithms Mol Biol 13:7
Altschul, Stephen; Demchak, Barry; Durbin, Richard et al. (2013) The anatomy of successful computational biology software. Nat Biotechnol 31:894-7
Boratyn, Grzegorz M; Schaffer, Alejandro A; Agarwala, Richa et al. (2012) Domain enhanced lookup time accelerated BLAST. Biol Direct 7:12
Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24