In the last few years, rapid accumulation of genome sequences and protein structures has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI formed the basis of our work on protein motif analysis. A new mode of PSI-BLAST application which includes exhaustive database search by repeating PSI-BLAST iterations to convergence with newly identified protein family members was developed and implemented in an automatic procedure. Two other new procedures, IMPALA and RPS-BLAST, allow one to search a library of protein family profiles by using an individual protein sequence as a query. The BLAST-CLUST procedure was developed to flexibly cluster proteins by sequence similarity using BLAST search outputs in the input. These methods were applied to perform a systematic survey of completely sequenced genomes and to produce a census of protein structural folds. A theoretical study on prediction of the total number of protein folds and families was performed; the estimates of approximately 1000 for the former and approximately 5000 for the latter were produced. The evolutionary history and phyletic distribution of several types of protein domains were analyzed in detail, including a variety of proteins involved in RNA metabolism and programmed cell death, the vast class of GTPases and related ATPases, P-loop kinases and a variety of other protein classes.
Showing the most recent 10 out of 50 publications