The CDD (Conserved Domain Database) at NCBI currently uses local alignment tools (rps-BLAST) to perform its update. The update consists of taking a new putative protein sequence and matching it to a PSSM (position-specific scoring matrix) in the CDD. Sometimes, chimeric sequences corrupt the database, because they match well locally to PSSMs in the CDD, without having a full length match. It stands to reason that a global alignment method would be able to detect the chimeras, because the non-matching chimeric length would cause a low global score, although it does not cause a low local score. Currently, much human effort is directed at curating and culling out chimeras. The lack of a global p-value was the main obstacle to using global alignment in the CDD update. We developed a global alignment algorithm that is now currently used in the curation of the CDD. We are also developing a statistical p-value for global alignments. Dr Sergey Sheetlin is programming the method, for testing by Maricel Kann in the NCBI structure group. Dr Maricel Kann has shown in preliminary tests show that the global alignment is more sensitive to certain details of a sequence, and sometimes can place a sequence in the correct subfamily than local alignment.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM091804-02
Application #
7148168
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
2005
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Kann, Maricel G; Sheetlin, Sergey L; Park, Yonil et al. (2007) The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 35:4678-85