The central focus of the project this year was the investigation of ways to improve the retrieval accuracy of DELTA-BLAST through the use of "model surgery" and of asymmetric but uniform gap costs. Traditionally, PSI-BLAST constructs its position specific score matrix (PSSM) using the query sequence as a template, with each amino acid serving as a place holder for a column of the matrix. However, if the query sequence includes an atypical insertion or deletion, the resulting PSSM will be handicapped in having to imply a corresponding deletion or insertion when aligning to most related sequences. The recently developed DELTA-BLAST first aligns a query sequence to a database of PSSMs, and this opens the possibility of allowing the constucted PSSM to take its length from any aligned PSSMs rather than from the query. Furthermore, it is possible to treat insertions and deletions with respect to PSSMs constructed using such model surgery asymmetrically, for example penalizing insertions less than deletions. We have achieved statistically significant improvements using this approach, and are investigating whether it would be fruitful to extend the Conserved Domain Database (CDD) so that it included family-specific insertion and deletion costs.
|Altschul, Stephen F; Gertz, E Michael; Agarwala, Richa et al. (2009) PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res 37:815-24|