The rapid accumulation of genome sequences and protein structures during the last decade has been paralleled by major advances in sequence database search methods. The powerful Position-Specific Iterating BLAST (PSI-BLAST) method developed at the NCBI forms the basis of our work on protein motif analysis. In addition, Hidden Markov Models (HMM), protein profile-against-profile comparison implemented in the HHSearch method, protein structure comparison methods, homology modeling of protein structure and genome context analysis were extensively and increasingly applied. Furthermore, custom libraries of protein domain profiles as well as computational pipelines for novel domain identification have been developed and applied. During the year under review, we have continued and expanded our investigation of the proteins domain that are involved in virus-host interactions in prokaryotes. In particular, we have revealed unexpected connections between type VI-B CRISPR-Cas systems, bacterial natural competence, ubiquitin signaling network and DNA modification through a distinct family of membrane proteins In addition to core Cas proteins, CRISPR-Cas loci often encode ancillary proteins that modulate the activity of the respective effectors in interference. Subtype VI-B1 CRISPR-Cas systems encode the Csx27 protein that down-regulates the activity of Cas13b when the type VI-B locus is expressed in Escherichia coli. We show that Csx27 belongs to an expansive family of proteins that contain four predicted transmembrane helices and are typically encoded in predicted operons with components of the bacterial natural transformation machinery, multidomain proteins that consist of components of the ubiquitin signaling system and proteins containing the ligand-binding WYL domain and a helix-turn-helix domain. The Csx27 family proteins are predicted to form membrane channels for ssDNA that might comprise the core of a putative novel, Ub-regulated system for DNA uptake and, possibly, degradation. In addition to these associations, a distinct subfamily of the Csx27 family appears to be a part of a novel, membrane-associated system for DNA modification. In Bacteroidetes, subtype VI-B1 systems might degrade nascent transcripts of foreign DNA in conjunction with its uptake by the bacterial cell. These predictions suggest several experimental directions for the study of type VI CRISPR-Cas systems and distinct mechanisms of foreign DNA uptake and degradation in bacteria. Additionally, we have identified highly derived class 1 CRISPR-Cas system in Haloarchaea that contain diverged Cas5 and Cas7 homologs but no CRISPR array. Screening of genomic and metagenomic databases for new variants of CRISPR-Cas systems increasingly results in the discovery of derived variants that do not seem to possess the interference capacity and are implicated in functions distinct from adaptive immunity. We describe an extremely derived putative class 1 CRISPR-Cas system that is present in many Halobacteria and consists of distant homologs of the Cas5 and Cas7 protein along with an uncharacterized conserved protein and various nucleases. We hypothesize that, although this system lacks typical CRISPR effectors or a CRISPR array, it functions as a RNA-dependent defense mechanism that, unlike other derived CRISPR-Cas, utilizes alternative nucleases to cleave invader genomes. We further expanded our study of proteins domains to investigate polymorphic toxins and other hypervariable protein domains that are involved in host-parasite interactions and interspecies conflicts in microbes. Numerous, diverse, highly variable defense and offense genetic systems are encoded in most bacterial genomes and are involved in various forms of conflict among competing microbes or their eukaryotic hosts. We focused on the offense and self-versus-nonself discrimination systems encoded by archaeal genomes that so far have remained largely uncharacterized and unannotated. Specifically, we analyzed archaeal genomic loci encoding polymorphic and related toxin systems and ribosomally synthesized antimicrobial peptides. Using sensitive methods for sequence comparison and the guilt by association approach, we identified such systems in 141 archaeal genomes. These toxins can be classified into four major groups based on the structure of the components involved in the toxin delivery. The toxin domains are often shared between and within each system. We revisit halocin families and substantially expand the halocin C8 family, which was identified in diverse archaeal genomes and also certain bacteria. Finally, we employ features of protein sequences and genomic locus organization characteristic of archaeocins and polymorphic toxins to identify candidates for analogous but not necessarily homologous systems among uncharacterized protein families. This work confidently predicts that more than 1,600 archaeal proteins, currently annotated as hypothetical in public databases, are components of conflict and self-versus-nonself discrimination systems. This work is expected to stimulate experimental research to advance the understanding of poorly characterized major aspects of archaeal biology. In a more theoretical vein, we have explored the 'grammar' of protein domains encoded in genomes across the diversity of life. From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the proteome grammar-that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of protein languages in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at 1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a quasi-universal grammar underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell. The research performed over the last year has led to further progress in the study of the classification, evolution, and functions of several classes of proteins and domains, particularly, those involved in host-parasite interactions and other forms of biological conflicts, as well as the theory of protein domain architecture evolution. These findings have potential implications for human health and for developments in biotechnology.
Showing the most recent 10 out of 117 publications