Finding Protein Sequence Motifs - methods And Applications

Koonin, Eugene

Abstract

The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, protein structure comparison methods, homology modeling of protein structure and genome context analysis were extensively and increasingly applied. Furthermore, custom pipelines for novel domain identification have been developed and applied. The research performed over the last year, has led to further progress in the study of the classification, evolution, and functions of several classes of proteins and domains. In particular, a systematic comparative genomic analysis of all archaeal membrane proteins that have been projected to the last archaeal common ancestor gene set led to the identification of several novel components of predicted secretion, membrane remodeling, and protein glycosylation systems. Among other findings, most crenarchaea have been shown to encode highly diverged orthologs of the membrane insertase YidC, which is nearly universal in bacteria, eukaryotes, and euryarchaea. We also identified a vast family of archaeal proteins, including the C-terminal domain of N-glycosylation protein AglD, as membrane flippases homologous to the flippase domain of bacterial multipeptide resistance factor MprF, a bifunctional lysylphosphatidylglycerol synthase and flippase. Additionally, several proteins were predicted to function as membrane transporters. The results of this work, combined with our previous analyses, reveal an unexpected diversity of putative archaeal membrane-associated functional systems that remain to be functionally characterized. A more general conclusion from this project is that the currently available collection of archaeal (and bacterial) genomes could be sufficient to identify (almost) all widespread functional modules and develop experimentally testable predictions of their functions In a separate project we investigated the domains that are involved in archaeal DNA replication and discovered a novel domain family that is essential for replication initiation. Archaea encode a eukaryotic-type primase comprising a catalytic subunit (PriS) and a noncatalytic subunit (PriL). In collaboration with the laboratory of Li Huang of the Chinese Academy of Sciences, we identified a primase noncatalytic subunit, denoted PriX, from the hyperthermophilic archaeon Sulfolobus solfataricus. Like PriL, PriX is essential for the survival of the organism. The crystallographic analysis complemented by sensitive sequence comparisons shows that PriX is a diverged homologue of the C-terminal domain of PriL but lacks the iron-sulfur cluster. Phylogenomic analysis provides clues on the origin and evolution of PriX. PriX, PriL and PriS form a stable heterotrimer (PriSLX). Both PriSX and PriSLX show far greater affinity for nucleotide substrates and are substantially more active in primer synthesis than the PriSL heterodimer. In addition, PriL, but not PriX, facilitates primer extension by PriS. We propose that the catalytic activity of PriS is modulated through concerted interactions with the two noncatalytic subunits in primer synthesis. In large scale bioinformatic study, we developed a major update of the Clusters of Orthologous Groups of proteins (COGs) database. Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The COG database that was first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics. These studies further enhance the existing understanding of the evolutionary plasticity and modularity of protein domain architectures and in addition, provide numerous experimentally testable predictions of biologically important protein functions.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000061-22
Application #: 9160908
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 22
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects

Publications

Krupovic, Mart; Cvirkaite-Krupovic, Virginija; Iranzo, Jaime et al. (2018) Viruses of archaea: Structural, functional, environmental and evolutionary genomics. Virus Res 244:181-193

Yutin, Natalya; Makarova, Kira S; Gussow, Ayal B et al. (2018) Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut. Nat Microbiol 3:38-46

He, Fei; Bhoobalan-Chitty, Yuvaraj; Van, Lan B et al. (2018) Anti-CRISPR proteins encoded by archaeal lytic viruses inhibit subtype I-D immunity. Nat Microbiol 3:461-469

Shmakov, Sergey A; Makarova, Kira S; Wolf, Yuri I et al. (2018) Systematic prediction of genes functionally linked to CRISPR-Cas systems by gene neighborhood analysis. Proc Natl Acad Sci U S A 115:E5307-E5316

Pushkarev, Alina; Inoue, Keiichi; Larom, Shirley et al. (2018) A distinct abundant group of microbial rhodopsins discovered using functional metagenomics. Nature 558:595-599

Yutin, Natalya; Bäckström, Disa; Ettema, Thijs J G et al. (2018) Vast diversity of prokaryotic virus genomes encoding double jelly-roll major capsid proteins uncovered by genomic and metagenomic sequence analysis. Virol J 15:67

Ferrer, Manuel; Sorokin, Dimitry Y; Wolf, Yuri I et al. (2018) Proteomic Analysis of Methanonatronarchaeum thermophilum AMET1, a Representative of a Putative New Class of Euryarchaeota, ""Methanonatronarchaeia"". Genes (Basel) 9:

Koonin, Eugene V; Makarova, Kira S (2018) Discovery of Oligonucleotide Signaling Mediated by CRISPR-Associated Polymerases Solves Two Puzzles but Leaves an Enigma. ACS Chem Biol 13:309-312

Galperin, Michael Y; Makarova, Kira S; Wolf, Yuri I et al. (2018) Phyletic Distribution and Lineage-Specific Domain Architectures of Archaeal Two-Component Signal Transduction Systems. J Bacteriol 200:

Koonin, Eugene V; Makarova, Kira S; Wolf, Yuri I (2017) Evolutionary Genomics of Defense Systems in Archaea and Bacteria. Annu Rev Microbiol 71:233-261

Showing the most recent 10 out of 117 publications

Comments

Be the first to comment on Eugene Koonin's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: