Ribonucleic acid (RNA) molecules play important roles in many biological processes, including gene expression and regulation. An RNA molecule is a linear polymer that folds back on itself to form a three-dimensional functional structure. While experimental determination of precise RNA structures is a time-consuming and costly process, useful information about the molecule can be gained from knowing its secondary structure, a collection of hydrogen-bonded base pairs. Structural elements in RNA secondary structures can be separated into two large categories: stem-loops and pseudoknots. Development of mathematical models and prediction algorithms for simple stem-loop structures has started in the 1980?s. However, the tremendous demand on computer memory and time by pseudoknot prediction remains a computing challenge even today. The recently developed grid computing technology can offer a possible solution to this challenge. In this project, the investigators shall address some mathematical problems associated with the grid computing approach to RNA secondary structure prediction. To partition a large RNA molecule to smaller segments assigned to different computers on the grid, a good cutting strategy is necessary. The investigators propose to develop probabilistic models to study the inversion distribution in RNA sequences and to combine the results with the general theory of excursions to maximize the prediction accuracy using an optimal RNA segment length. The mathematical results will be integrated into a toolset for computational RNA structure analysis to test their applicability. The toolset will be used to investigate the possible association of pseudoknot types with functions using data in public domains and applied to the prediction of secondary structures adopted by nodavirus genomes. The prediction results will be compared with the secondary structures experimentally determined by mutational studies.

Studies on RNA secondary structures and functions contribute significantly to the understanding and control of RNA viruses in plant and animal diseases. Computational methods recently used in these RNA studies have been shown to greatly reduce time and cost. Solving computationally intensive problems is now feasible using grid computing technology that simulates high-performance computing on a superstructure of networked computers. By combining rigorous mathematical methods, current computing technology, and careful experimental verification, this project develops an interactive investigative approach in an interdisciplinary research endeavor. The validation of computational prediction results by wet-lab experiments will encourage experimental scientists to make use of computing technology with confidence to assist their scientific pursuits. Such a collaborative effort will have significant impacts on education and training of new scientists by promoting the concept of interdisciplinary research designs, by enhancing diversity in the student population from different geographical and economic areas, and by encouraging the development of long-term collaborative research and educational partnerships between minority-serving institutions and research universities. The work of these investigators will result in a research product for viral RNA genome analyses with mathematical concepts useful for studying many other patterns in genetic sequences. This project represents a direct contribution of mathematics through the advancement of computing technologies to the elucidation of diverse biological functions of RNA.

Project Report

An RNA (ribonucleic acid) molecule is a single stranded linear polymer, made up of four types of nucleotide bases Adenine (A), Cytosine (C), Guanine (G), and Uracil (U). Among the four nucleotide bases, C and G form complementary base pairs by hydrogen bonding, as do A and U; in RNA (but not DNA), G can also base pair with U. Ribonucleic acid (RNA) serves many important cellular functions, including gene expression, protein synthesis, and innate immune responses. Many viral genomes consist entirely of RNA, which must be translated into proteins and amplified by RNA replication. Examples of RNA viruses include the HIV and H1N1 viruses that caused the global AIDS and flu epidemics in the recent decades, as well as the agriculturally important nodavirsues that infect fish and insects worldwide. The linear strand of RNA tends to fold back on itself to form a 3-dimensional (3D) functional structure, mostly by pairing complementary bases. As the 3D structure of an RNA molecule is often the key to its function, RNA structure has always been a subject of interest. However, because of the instability of RNA molecules, experimental determination of their precise 3D structures is a time-consuming and costly process. Fortunately, useful information about the molecule can still be gained by just knowing its secondary structure, which is defined as the collection of hydrogen bonded base pairs in the molecule. Development of mathematical models and computational prediction algorithms for RNA secondary structure has started as early as the 1980’s. These prediction algorithms are generally based on a search for optimal and suboptimal structures with minimal free energies for the complete RNA molecule, which is the amount of energy required to completely unpair all of the hydrogen-bonded base pairs that hold the structure together (e.g., by denaturing it with heat). The elements that make up the secondary structure of RNA can be separated into two large categories: stem-loops and pseudoknots. While prediction of stem-loops is relatively straightforward, the tremendous demand on computer memory and time by pseudoknot prediction for large RNA molecules remains a computational challenge. Our project aims at devising a general approach that will efficiently predict secondary structures in RNA. Observing that both the stem-loop and pseudoknot structures must contain at least one inversion, i.e., a string of nucleotides followed closely by its inverse complementary sequence, our approach is to first strategically cut a long RNA sequence into smaller chunks based on statistical properties of inversion distributions, predict the secondary structures of each chunk individually by multiple processors running in parallel using any existing prediction algorithm, and then assembling the prediction results to give the structure of the original sequence. We tested the approach on RNA molecules whose secondary structures have been experimentally established previously. In addition to the expected improvement in efficiency attributed to the use of parallel computing, our results indicate that even the average prediction accuracy is enhanced. These findings suggest that local structures formed by pairings among nucleotides in close proximity and based on local, rather than the global, minimal free energies, may better correlate with the real molecular structure of long RNA sequences. While the hypothesis has yet to be supported by more experimental evidence, our approach, if proved correct, will open the door to a new generation of structure prediction methods based on sequence segmentation. We have integrated this method into a publicly accessible software toolset for RNA Secondary Structure Analysis (RNASSA), and applied these programs to analyze the Nodamura Virus. A predicted stem-loop structure close to one terminal of the RNA2 of the virus genome was experimentally verified by mutational studies in the molecular virology lab, and found to be an essential signal for RNA replication. Through this project, we have developed an interactive investigative approach that combines rigorous mathematical methods, current computing technology, and careful experimental verification together in our research endeavors. The collaborative project between the University of Texas at El Paso and University of Delaware has provided training and support to over 25 graduate and undergraduate research students during the past five years. Among these students, over 50% are women or underrepresented minorities. Working across multiple disciplines, institutions, and cultures has generated a unique experience for everyone involved in the project.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0800272
Program Officer
Mary Ann Horn
Project Start
Project End
Budget Start
2008-06-01
Budget End
2013-05-31
Support Year
Fiscal Year
2008
Total Cost
$415,632
Indirect Cost
Name
University of Texas at El Paso
Department
Type
DUNS #
City
ElPaso
State
TX
Country
United States
Zip Code
79968