An approach based on combinatorics and probability is proposed to tackle a number of problems in computational biology. This is part of an exploratory research for an EAGER grant that aims at extracting simple and elegant mathematical abstractions rather than focusing on the ?nitty-gritty? details of the biological problem. The rationale is the following: (1) Properties of combinatorial objects lead directly to algorithms for solving the problems that generate them and (2) combinatorial and probabilistic methods provide many analytical tools that can be used for determining the worst-case and expected performance of these algorithms. To establish the validity of the approach, the PIs allow some breadth by considering three exemplar problems. This, however, is consistent with their long term goal to develop mathematical models that lead to viable computational algorithms and that can explain biological behavior at an aggregate level. RNA interaction: An siRNA-based (small interfering RNA) treatment that may ultimately counteract HIV is now not far fetched. An siRNA is a special example of an RNA molecule that interacts with another RNA (e.g. that of HIV), in this case to knockout the HIV gene. Because of its role in gene regulation mechanisms, RNA interaction has a potential to become a new class of drugs. The PIs will conduct a research based on RNA-RNA interaction graphs to predict RNA complexes resulting from the interaction of two RNAs. While this prediction problem is NP-complete, the PIs have proposed novel and efficient approximation algorithms that predict known and unusual RNA complexes in E. Coli. The PIs will improve the running time/approximation capability of the algorithms, apply the algorithms to a wide range of RNA complexes, and study the extension of the algorithms to handle multiple RNAs (not just two) using a combination of RNA-RNA interaction graphs and random search. Protein interaction sites: Similarly, protein interaction is crucial for determining the function of protein complexes. Interaction graphs, however, do not provide a suitable model here because protein interaction is more complex. Instead, predicting the interaction sites of a protein becomes a central task. The PIs will implement a combinatorial approach based on folding the protein on a torus (closed helix) and geometrically grouping amino acids with certain properties (e.g. hydrophobic) to obtain clusters. The clusters represent potential interaction sites. One motivation for this approach is that hydrophobic helices tend to stay away from the solvent and, hence, to interact. Together with a mathematical model of random tori (that will also be developed), this new approach will potentially eliminate the need to predict the 3D folding of the protein (highly intractable problem) and overcomes the simplicity of methods that, otherwise, are entirely sequences based (ignore the geometry). Low complexity sequences LCS: While structure resulting from folding and/or interaction is an important aspect, the lack of structure in proteins raises an important question about the function of their sequences, especially when those sequences are preserved. The cell wall genes of fungi contain an abundance of LCS that are mostly structure-free. Understanding the function of LCS will help guide any effective medical treatment, for instance against uterine infections, that must target the fungus through its cell wall interface. The PIs believe that LCS evolve using a mechanism similar to DNA replication error, resulting in sequences with large deviations in length (a power law distribution). The PIs propose a probabilistic model of evolution that explicitly accounts for lengths and exhibits a similar distribution. This model will help explain the type of evolution that LCS undergo and will shed light on their function in the cell wall. The model will also provide an alternative to alignment-based methods which, despite many existing efforts, usually fail in the presence of LCS. The intellectual merit of the proposal lies in providing essential foundation for a number of combinatorial/probabilistic problems that require a strong knowledge of various fields of mathematics and algorithms, and provide radical ways to capture different aspects of Biology. Therefore, while the research has roots in combinatorial mathematics and probability, it has simultaneously a broader impact on biological sciences. Despite Biology being the driving force, the formulations are general enough and extend the research beyond its biological significance: RNA interaction introduces an interesting geometric graph problem that avoids intersection of edges. Protein interaction sites lead to an elegant problem on regular graphs. Evolution of LCS is captured by a general random walk that is applicable for many systems that exhibits random elongation and shortening over time, e.g. words in a sentence (linguistics). The proposal has also a broader impact on the Hunter College QuBi (Quantitative Biology) initiative where it could provide substantial projects material for the newly developed QuBi courses.

Project Report

Multiple RNA Interaction The role of interaction between two or more RNA molecules has been increasingly recognized in regulatory mechanisms, including gene expression, methylation, and splicing. Pairwise interaction has been noted for regulating gene expression, e.g. when one RNA binds to the ribosome binding site of another mRNA, thus blocking its translation to protein. Typical scenarios of multiple RNA interaction involve the interaction of multiple small nucleolar RNAs (snoRNAs) with ribosomal RNAs (rRNAs) in guiding the methylation of the rRNAs, and multiple small nuclear RNAs (snRNA) with mRNAs in the splicing of introns. While the prediction of structures resulting from pairwise interactions is now somewhat understood, structure prediction in the context of multiple RNAs (more than just two) is still primitive. For instance, most of the existing approaches concatenate the RNAs into a single long RNA, which is then folded. On the one hand, this presents a challenge to existing folding algorithms, which are far less reliable when the RNA is too long. On the other hand, most folding algorithms prevent the formation of pseudoknots due to their increased computational complexity. While pseudoknots are rare in folded structures, they translate into kissing loops when spanning multiple RNAs, which are quite frequent in interacting RNA structures. Even though algorithms for the concatenation model with kissing loops exist, advances in pairwise interaction algorithms suggest that the latter are more adequate. Therefore, a promising approach is to adapt existing pairwise interaction algorithms to the case of multiple RNAs. However, even with the existence of a computational framework for a pairwise RNA-RNA interaction, it is reasonable to believe that interactions involving multiple RNAs are generally more complex to be treated pairwise. In addition, given a pool of RNAs, and without prior biological information, it is not trivial to predict which RNAs actually interact. This generally leads to a computational hurdle: when RNAs are treated pairwise, an immediate consequence is the greedy nature of the algorithm. The best interacting pair of RNAs will dominate the solution. This in turn will ``lock'' the interaction pattern of the whole ensemble into a sub-optimal state; thus preventing the correct structure from presenting itself as a solution. We established a mathematical formulation based on combinatorial optimization that overcomes the issues outlined above. The model derives from the premise that multiple RNA interaction may be the result of competing pairwise interactions that settle; and yet a simple pairwise handling of the RNAs will not capture that ``competition''. And while the resulting problem is NP-hard, it admits an approximation algorithm with provable bounds on optimality. The important feature is that we capitalize on existing pairwise interaction algorithms while avoiding the ``locking'' hurdle mentioned earlier. Protein Secondary Structure There is a long history of techniques that predict the formation of structure in proteins (whether it being secondary or tertiary), which are successful to some extent. However, we believe there are insightful connections to be made with abstract mathematical models of structure formation (not necessarily biological) that could shed some light on the evolutionary process that structural biology has undergone. This in turn will help make even better structural predictions. Perhaps it has always been amazing, even to the pioneers, how simple mechanisms of evolution can produce Life as we know it today. Recently, attempts have been made to articulate the same amazement with abstract models, e.g. for the sub-optimal nature of the genetic code, and for the role of sex. We had a notable success by independently joining this recent trend in the context of protein structure. Our abstract model is based on infinite binary words and a generalization of the physics of percolation in one dimension. In this model, structures emerge by ``playing against randomness'' as unusual clusters of neighboring 1s. Yet, preliminary results suggest that the model mimics structural biology in various aspects. In fact, despite the abstract nature of the model and its apparent disconnect from Biology, it derives from two simple premises: Sequences of biological origin, like proteins, are the result of a long evolutionary process. When treated as words, these real biological sequences will reveal patterns that are not likely to be seen in randomly generated words. Interactions among amino acids in a protein exhibit some locality. This locality is especially obvious in secondary structures where ``neighboring'' amino acids form helices and strands. The notions of "unusual" pattern and neighborhood are central to our model and are used to define the clusters of neighboring 1s. When protein sequences are binarized, our model achieves up to 60% correct prediction of secondary structure using the protein sequence information alone (no alignments or homology). In addition, this accuracy can be boosted up to 70% if we combine our method with machine learning to learn the binary patterns of clusters and how they map to actual secondary structure.

Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$200,000
Indirect Cost
Name
CUNY Hunter College
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10065