Finding Protein Sequence Motifs - methods And Applications

Koonin, Eugene

Abstract

The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, protein structure comparison methods, homology modeling of protein structure and genome context analysis were extensively and increasingly applied. Furthermore, custom libraries of protein domain profiles as well as computational pipelines for novel domain identification have been developed and applied. During the year under review, we have continued and expanded our investigation of the proteins domain that are involved in virus-host interactions in prokaryotes. In particular, we have revealed unexpected connections between type VI-B CRISPR-Cas systems, bacterial natural competence, ubiquitin signaling network and DNA modification through a distinct family of membrane proteins In addition to core Cas proteins, CRISPR-Cas loci often encode ancillary proteins that modulate the activity of the respective effectors in interference. Subtype VI-B1 CRISPR-Cas systems encode the Csx27 protein that down-regulates the activity of Cas13b when the type VI-B locus is expressed in Escherichia coli. We show that Csx27 belongs to an expansive family of proteins that contain four predicted transmembrane helices and are typically encoded in predicted operons with components of the bacterial natural transformation machinery, multidomain proteins that consist of components of the ubiquitin signaling system and proteins containing the ligand-binding WYL domain and a helix-turn-helix domain. The Csx27 family proteins are predicted to form membrane channels for ssDNA that might comprise the core of a putative novel, Ub-regulated system for DNA uptake and, possibly, degradation. In addition to these associations, a distinct subfamily of the Csx27 family appears to be a part of a novel, membrane-associated system for DNA modification. In Bacteroidetes, subtype VI-B1 systems might degrade nascent transcripts of foreign DNA in conjunction with its uptake by the bacterial cell. These predictions suggest several experimental directions for the study of type VI CRISPR-Cas systems and distinct mechanisms of foreign DNA uptake and degradation in bacteria. Additionally, we have identified highly derived class 1 CRISPR-Cas system in Haloarchaea that contain diverged Cas5 and Cas7 homologs but no CRISPR array. Screening of genomic and metagenomic databases for new variants of CRISPR-Cas systems increasingly results in the discovery of derived variants that do not seem to possess the interference capacity and are implicated in functions distinct from adaptive immunity. We describe an extremely derived putative class 1 CRISPR-Cas system that is present in many Halobacteria and consists of distant homologs of the Cas5 and Cas7 protein along with an uncharacterized conserved protein and various nucleases. We hypothesize that, although this system lacks typical CRISPR effectors or a CRISPR array, it functions as a RNA-dependent defense mechanism that, unlike other derived CRISPR-Cas, utilizes alternative nucleases to cleave invader genomes. We further expanded our study of proteins domains to investigate polymorphic toxins and other hypervariable protein domains that are involved in host-parasite interactions and interspecies conflicts in microbes. Numerous, diverse, highly variable defense and offense genetic systems are encoded in most bacterial genomes and are involved in various forms of conflict among competing microbes or their eukaryotic hosts. We focused on the offense and self-versus-nonself discrimination systems encoded by archaeal genomes that so far have remained largely uncharacterized and unannotated. Specifically, we analyzed archaeal genomic loci encoding polymorphic and related toxin systems and ribosomally synthesized antimicrobial peptides. Using sensitive methods for sequence comparison and the guilt by association approach, we identified such systems in 141 archaeal genomes. These toxins can be classified into four major groups based on the structure of the components involved in the toxin delivery. The toxin domains are often shared between and within each system. We revisit halocin families and substantially expand the halocin C8 family, which was identified in diverse archaeal genomes and also certain bacteria. Finally, we employ features of protein sequences and genomic locus organization characteristic of archaeocins and polymorphic toxins to identify candidates for analogous but not necessarily homologous systems among uncharacterized protein families. This work confidently predicts that more than 1,600 archaeal proteins, currently annotated as hypothetical in public databases, are components of conflict and self-versus-nonself discrimination systems. This work is expected to stimulate experimental research to advance the understanding of poorly characterized major aspects of archaeal biology. In a more theoretical vein, we have explored the 'grammar' of protein domains encoded in genomes across the diversity of life. From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the proteome grammar-that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of protein languages in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at 1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a quasi-universal grammar underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell. The research performed over the last year has led to further progress in the study of the classification, evolution, and functions of several classes of proteins and domains, particularly, those involved in host-parasite interactions and other forms of biological conflicts, as well as the theory of protein domain architecture evolution. These findings have potential implications for human health and for developments in biotechnology.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000061-26
Application #: 10007520
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 26
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects

Publications

Krupovic, Mart; Cvirkaite-Krupovic, Virginija; Iranzo, Jaime et al. (2018) Viruses of archaea: Structural, functional, environmental and evolutionary genomics. Virus Res 244:181-193

Yutin, Natalya; Makarova, Kira S; Gussow, Ayal B et al. (2018) Discovery of an expansive bacteriophage family that includes the most abundant viruses from the human gut. Nat Microbiol 3:38-46

He, Fei; Bhoobalan-Chitty, Yuvaraj; Van, Lan B et al. (2018) Anti-CRISPR proteins encoded by archaeal lytic viruses inhibit subtype I-D immunity. Nat Microbiol 3:461-469

Shmakov, Sergey A; Makarova, Kira S; Wolf, Yuri I et al. (2018) Systematic prediction of genes functionally linked to CRISPR-Cas systems by gene neighborhood analysis. Proc Natl Acad Sci U S A 115:E5307-E5316

Pushkarev, Alina; Inoue, Keiichi; Larom, Shirley et al. (2018) A distinct abundant group of microbial rhodopsins discovered using functional metagenomics. Nature 558:595-599

Yutin, Natalya; Bäckström, Disa; Ettema, Thijs J G et al. (2018) Vast diversity of prokaryotic virus genomes encoding double jelly-roll major capsid proteins uncovered by genomic and metagenomic sequence analysis. Virol J 15:67

Ferrer, Manuel; Sorokin, Dimitry Y; Wolf, Yuri I et al. (2018) Proteomic Analysis of Methanonatronarchaeum thermophilum AMET1, a Representative of a Putative New Class of Euryarchaeota, ""Methanonatronarchaeia"". Genes (Basel) 9:

Koonin, Eugene V; Makarova, Kira S (2018) Discovery of Oligonucleotide Signaling Mediated by CRISPR-Associated Polymerases Solves Two Puzzles but Leaves an Enigma. ACS Chem Biol 13:309-312

Galperin, Michael Y; Makarova, Kira S; Wolf, Yuri I et al. (2018) Phyletic Distribution and Lineage-Specific Domain Architectures of Archaeal Two-Component Signal Transduction Systems. J Bacteriol 200:

Smargon, Aaron A; Cox, David B T; Pyzocha, Neena K et al. (2017) Cas13b Is a Type VI-B CRISPR-Associated RNA-Guided RNase Differentially Regulated by Accessory Proteins Csx27 and Csx28. Mol Cell 65:618-630.e7

Showing the most recent 10 out of 117 publications

Comments

Be the first to comment on Eugene Koonin's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: