The CDD (Conserved Domain Database) at NCBI currently uses local alignment tools (rps-BLAST) to perform its update. The update consists of taking a new putative protein sequence and matching it to a PSSM (position-specific scoring matrix) in the CDD. Sometimes, chimeric sequences corrupt the database, because they match well locally to PSSMs in the CDD, without having a full length match. It stands to reason that a global alignment method would be able to detect the chimeras, because the non-matching chimeric length would cause a low global score, although it does not cause a low local score. Currently, much human effort is directed at curating and culling out chimeras. The lack of a global p-value was the main obstacle to using global alignment in the CDD update. We developed a global alignment algorithm that is now currently used in the curation of the CDD. We are also developing a statistical p-value for global alignments. Dr Sergey Sheetlin is programming the method, for testing by Maricel Kann in the NCBI structure group. Dr Maricel Kann has shown in preliminary tests show that the global alignment is more sensitive to certain details of a sequence, and sometimes can place a sequence in the correct subfamily than local alignment.
Kann, Maricel G; Sheetlin, Sergey L; Park, Yonil et al. (2007) The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 35:4678-85 |