The rapidly growing database of completely and nearly completely sequenced genomes of bacteria, archaea, eukaryotes and viruses (several thousand genomes already available and many more in progress) creates both extensive new opportunities and major new challenges for genome research. During the year in review, we performed a variety of studies that took advantage of the genomic information to establish fundamental principles of genome evolution. To a large extent, we have focused on cancer genome evolution. Cancer arises through the accumulation of somatic mutations over time. Understanding the sequence of mutation occurrence during cancer progression can assist early and accurate diagnosis and improve clinical decision-making. Here we employ long short-term memory (LSTM) networks, a class of recurrent neural network, to learn the evolution of a tumor through an ordered sequence of mutations. We demonstrate the capacity of LSTMs to learn complex dynamics of the mutational time series governing tumor progression, allowing accurate prediction of the mutational burden and the occurrence of mutations in the sequence. Using the probabilities learned by the LSTM, we simulate mutational data and show that the simulation results are statistically indistinguishable from the empirical data. We identify passenger mutations that are significantly associated with established cancer drivers in the sequence and demonstrate that the genes carrying these mutations are substantially enriched in interactions with the corresponding driver genes. Breaking the network into modules consisting of driver genes and their interactors, we show that these interactions are associated with poor patient prognosis, thus likely conferring growth advantage for tumor progression. Thus, application of LSTM provides for prediction of numerous additional conditional drivers and reveals hitherto unknown aspects of cancer evolution. In another cancer genomics project, we explored proteomic and genomic signatures of repeat instability in cancer and adjacent normal tissues. Repetitive sequences are hotspots of evolution at multiple levels. However, due to difficulties involved in their assembly and analysis, the role of repeats in tumor evolution is poorly understood. We developed a rigorous motif-based methodology to quantify variations in the repeat content, beyond microsatellites, in proteomes and genomes directly from proteomic and genomic raw data. This method was applied to a wide range of tumors and normal tissues. We identify high similarity between repeat instability patterns in tumors and their patient-matched adjacent normal tissues. Nonetheless, tumor-specific signatures both in protein expression and in the genome strongly correlate with cancer progression and robustly predict the tumorigenic state. In a patient, the hierarchy of genomic repeat instability signatures accurately reconstructs tumor evolution, with primary tumors differentiated from metastases. We observe an inverse relationship between repeat instability and point mutation load within and across patients independent of other somatic aberrations. Thus, repeat instability is a distinct, transient, and compensatory adaptive mechanism in tumor evolution and a potential signal for early detection. Additionally, we have continued intensive research into evolutionary genomics of viruses and antivirus defense systems. In particular, we carried out a detailed investigation of CRISPR-Cas systems encoded in mobile genetic elements and involved in counter-defence and other functions. The principal function of CRISPR-Cas systems in archaea and bacteria is defence against mobile genetic elements (MGEs), including viruses, plasmids and transposons. However, the relationships between CRISPR-Cas and MGEs are far more complex. Several classes of MGE contributed to the origin and evolution of CRISPR-Cas, and, conversely, CRISPR-Cas systems and their components were recruited by various MGEs for functions that remain largely uncharacterized. We investigated and substantially expanded the range of CRISPR-Cas components carried by MGEs. Three groups of Tn7-like transposable elements encode 'minimal' type I CRISPR-Cas derivatives capable of target recognition but not cleavage, and another group encodes an inactivated type V variant. These partially inactivated CRISPR-Cas variants might mediate guide RNA-dependent integration of the respective transposons. Numerous plasmids and some prophages encode type IV systems, with similar predicted properties, that appear to contribute to competition among plasmids and between plasmids and viruses. Many prokaryotic viruses also carry CRISPR mini-arrays, some of which recognize other viruses and are implicated in inter-virus conflicts, and solitary repeat units, which could inhibit host CRISPR-Cas systems. We also have developed a general theory of the origin of viruses from primordial replicators that various cellular proteins as capsid formation. Viruses are ubiquitous parasites of cellular life and the most abundant biological entities on Earth. It is widely accepted that viruses are polyphyletic, but a consensus scenario for their ultimate origin is still lacking. Traditionally, three scenarios for the origin of viruses have been considered: descent from primordial, precellular genetic elements, reductive evolution from cellular ancestors and escape of genes from cellular hosts, achieving partial replicative autonomy and becoming parasitic genetic elements. These classical scenarios give different timelines for the origin(s) of viruses and do not explain the provenance of the two key functional modules that are responsible, respectively, for viral genome replication and virion morphogenesis. We developed a 'chimeric' scenario under which different types of primordial, selfish replicons gave rise to viruses by recruiting host proteins for virion formation. We also propose that new groups of viruses have repeatedly emerged at all stages of the evolution of life, often through the displacement of ancestral structural and genome replication genes. Taken together, these studies advance the existing understanding of the general principles and specific aspects of genome evolution in diverse life forms, in particular, viruses and mobile elements, as well as cancer genome evolution.
Showing the most recent 10 out of 196 publications