NCBI currently uses the local alignment tool "rps-BLAST" to search the CDD. Local alignment tools are inherently inappropriate for CDD retrieval, because complete domains (by definition) are the units conserved in evolution. Thus, retrieval should compare complete domains to protein subsequences, which is "semi-global" alignment. Accordingly, we developed a semi-global alignment algorithm and a novel statistical approximation that discovers whole protein domains within a query protein sequence, thereby giving clues as to the function of novel protein sequences. Simulations show that the approximation is much better than the p-value approximations in other tools like HMMer and rps-BLAST, making GLOBAL a promising candidate for an iterative search tool in protein sequence databases. Dr Sergey Sheetlin implemented our method as a dynamic-programming algorithm in a program called "GLOBAL". Dr Kann analyzed the retrieval efficacy of several competitive methods, including HMMer, an implementation of Hidden Markov models (HMMs), and shown that the retrieval efficacies are in the order: HMMer (in global mode) - about the same as GLOBAL, GLOBAL - better than rps-BLAST. GLOBAL is in fact a degenerate HMM. While retaining HMM retrieval efficacies, GLOBAL is simple enough to be accelerated by the same heuristics used in local alignment methods like BLAST. Accordingly, Dr. Carroll accelerated GLOBAL using the BLAST word-heuristic, speeding it by about an order of magnitude, making its speed competitive with other domain-retrieval tools. He has incorporated the resulting code in the NCBI CoreTools. He is currently examining various iterative retrieval strategies, to develop an iterative tool for retrieval from a protein database (analogous to PSI-BLAST).
|Carroll, Hyrum D; Williams, Alex C; Davis, Anthony G et al. (2015) Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate. IEEE/ACM Trans Comput Biol Bioinform 12:531-7|
|Frith, Martin C; Park, Yonil; Sheetlin, Sergey L et al. (2008) The whole alignment and nothing but the alignment: the problem of spurious alignment flanks. Nucleic Acids Res 36:5863-71|