The CDD (Conserved Domain Database) at NCBI currently uses local alignment tools (rps-BLAST) to perform its update. The update consists of taking a new putative protein sequence and matching it to a PSSM (position-specific scoring matrix) in the CDD. Sometimes, chimeric sequences corrupt the database, because they match well locally to PSSMs in the CDD, without having a full length match. It stands to reason that a global alignment method would be able to detect the chimeras, because the non-matching chimeric length would cause a low global score, although it does not cause a low local score. Currently, much human effort is directed at curating and culling out chimeras. The lack of a global p-value was the main obstacle to using global alignment in the CDD update, but in fact I have known of a method for calculating it for some time. Sergey Sheetlin is programming the method, for testing by Maricel Kann in the NCBI structure group. Preliminary tests show that the global alignment is much more sensitive to details of a sequence, and appears much more able to place a sequence in the correct subfamily than local alignment.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM091804-01
Application #
6988473
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2004
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Kann, Maricel G; Sheetlin, Sergey L; Park, Yonil et al. (2007) The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 35:4678-85