Evolutionary patterns and processes are imprinted in the structure of nucleic acid molecules. In this project, an approach that embeds structure and function directly into phylogenetic analysis is used to search for these patterns and processes in the structure of non-protein coding RNA (ncRNA). Structural features are treated as ordered multi-state phylogenetic characters, and the transformation from one character state to another 'polarized' by invoking an evolutionary tendency towards molecular order. In an effort at synthesis, the approach is here used to integrate the information found in RNA structure with that in the phylogenomics of protein architecture, exploring the evolution of the protein biosynthetic machinery in light of other cellular processes. The project provides global evolutionary views of ncRNA at the structural and substructural levels and explores the origins of modern biochemistry. Major objectives include the following: (1) Global analysis of ncRNA structure: A molecular morphospace that describes the architecture and biophysics of RNA at global level is used to compare different classes of ncRNA molecules (from small RNA to large RNA ensembles) and produce pan-molecular phylogenies of structure. These phylogenies generate timelines of molecular diversification and tools for evolutionary classification. (2) Global analysis of ncRNA substructures: Phylogenetic trees of molecular substructures in molecules and molecular repertoires are generated together with evolutionary heat maps that map ancestries onto two- and three-dimensional molecular representations. The approach establishes evolutionary origins for the structures of ncRNAs examined and relative timelines of substructural diversification in molecules, molecular ensembles, and the evolved ncRNA world. (3) Models of structural evolution for ncRNAs: Matrices of character transformation costs and models of evolution defined by trees of substructures are used to reveal processes driving the evolution of RNA structure. (4) Synthesis of knowledge derived from RNA structure and protein architecture: Phylogenetic information embedded in trees of RNA molecules and substructures are integrated with phylogenomic information in trees of protein folds, fold superfamilies, and domain combinations. Timelines of protein and ncRNA evolution are linked to timelines of general biological processes. (5) Evolution of the protein biosynthetic machinery: The generation of trees of molecules and substructures is coupled to phylogenetic constraint analyses and interaction maps that compare evolution of functional substructures in different molecules, revealing diversification patterns and structural co-variation in components of the translation machinery. Synthesis is an essential component of scientific inquiry. This project is novel and integrates research related to the evolution of the modern protein and RNA worlds using molecular survey, history reconstruction, and computational analysis. It also provides a unique and phylogenetically-deep view of biology. Linking macromolecular structure and function is key to understanding cellular functions and how these evolve. This knowledge provides manifold benefits. RNA molecules are important catalysts and play crucial roles in cellular processes. They are, for example, used in RNA interference and antisense targeting applications that are economically important. However, the expanding repertoire of known functional RNA demands efforts to catalogue and classify RNA structures comparable to those in proteins. Finally, translational research is needed to interpret the fruits of functional and structural genomics. Evolutionary bioinformatics offers here unprecedented opportunities to address fundamental issues in the history of our natural world. Insights from the project are shared with the community through scientific meetings and short workshops. In addition, the research benefits society by providing unique educational opportunities, including research participation by undergraduate students belonging to groups underrepresented in science and the involvement of middle school teachers in the development of educational tools and other activities.
BACKGROUND: The origin and evolution of modern biochemistry is shrouded in mystery. The widely embraced ‘RNA world’ model of origin of life based on the premise that ‘genetics’ (nucleic acids and replication) preceded ‘metabolism’ (proteins and catalysis) is troublesome on many grounds and violates the principle of continuity treasured by evolutionary biologists. For example, the RNA world has not been persistent. It relinquished crucial catalytic and replication abilities to proteins (e.g., enzymes of central metabolism, synthetases, polymerases) and its remnants cannot function without them (e.g., ribosomes, RNase P, tRNA). Persistence is a necessary consequence of canalization and coordinated evolution (coevolution) that manifest in molecular structure, from highly dynamic conformers to folds that are almost immutable. These properties should have constrained the early evolution of the macromolecular machinery of life. STRATEGY AND INTELLECTUAL MERIT: In order to study the rise of modern biochemistry and patterns of canalization, coevolution and diversification in the world of molecules, we embed structure and function directly into phylogenetic analysis, studying thousands of non-protein coding RNA and millions of proteins. Structural features are treated as ordered multi-state phylogenetic characters. Transformation from one character state to another is ‘polarized’ by invoking an evolutionary tendency towards conformational order and increased abundance. Phylogenomic analyses provide an unprecedented historical account of the gradual discovery of RNA structures, primordial proteins, cofactors, and molecular functions, and establish crucial chronologies of protein-RNA interactions that link RNA and protein history. We study global historical patterns but also focus on structural and functional annotations and history of the most ancient proteins and ribonucleoprotein ensembles, such as the RNAse P complex and the ribosome. The existence of a molecular clock of protein structures place protein evolution on geological timescales. FINDINGS: Our timelines show that the complexity of modern biochemistry developed gradually on early Earth as new molecules and structures populated the emerging cellular systems. We show how primordial functions are linked to folded structures and how their interaction with cofactors expanded the functional repertoire. We also reveal protocell membranes played a crucial role in early protein evolution and show translation started relatively late with RNA and thioester cofactor-mediated aminoacylation. Our findings allow elaboration of an evolutionary model of early biochemistry that is firmly grounded in phylogenomic information and biochemical, biophysical and structural knowledge. The model describes how primordial alpha-helical bundles stabilized membranes, how these were decorated by layered arrangements of beta-sheets and alpha-helices, and how these arrangements became globular. Ancient forms of aminoacyl-tRNA synthetase (aaRS) catalytic domains and ancient nonribosomal protein synthetase (NRPS) modules gave rise to primordial protein synthesis and the ability to generate a code for specificity in their active sites. These structures diversified producing cofactor-binding molecular switches and barrel structures. Accretion of domains and molecules gave rise to modern aaRSs, NRPS, and ribosomal ensembles, first organized around novel emerging cofactors (tRNA and carrier proteins) and then more complex cofactor structures (rRNA). The model explains how the generation of protein structures acted as scaffold for nucleic acids and resulted in crystallization of modern translation. Remarkably, our analysis also revealed coevolution of ribosomal protein (r-protein) and rRNA structural components of the ribosome. Their relative ages were indexed with functional and molecular contact information and were mapped (colored) onto three-dimensional models of the ribosome (see figure). The outcomes of our studies were unexpected: (1) Subunit RNA and proteins coevolved tightly, starting with interactions between the oldest proteins (S12 and S17) and the oldest rRNA helix in the small subunit (the ribosomal ratchet responsible for ribosomal dynamics) and ending with the rise of a modern multisubunit ribosome; (2) A major transition in evolution ~3.1 billion years ago (Gy) brought independently evolving ribosomal subunits together by unfolding inter-subunit (bridge) contacts and interactions with full cloverleaf tRNA structures; (3) During this transition, a fully fledged peptidyl transferase center (PTC) responsible for protein synthesis appeared by duplication of local helical structures, supporting an appealing model of PTC origin; and (4) A second evolutionary transition occurred almost concurrently with the ‘great oxidation event’ of our planet (~2.4 Gy) and involved the discovery of the L7/L12 protein complex that stimulates the GTPase activity of elongation factor G. This second transition must have notably enhanced ribosomal efficiency. BROADER IMPACTS: Synthesis is an essential component of scientific inquiry. This project has uniquely integrated the evolution of the modern protein and RNA worlds. Molecular survey, history reconstruction, and computational analysis provided unique and phylogenetically deep views of biology that are unprecedented. The existence of coevolutionary patterns in the ribosome and evolutionary timelines of domains and molecular functions falsify the RNA world hypothesis. Instead, phylogenomic analyses support the gradual coevolution of protein and nucleic acids. Research advances have been extended to the community through scientific meetings, workshops, undergraduate minority participation, and involvement of middle school teachers in educational tool development and other activities.