The principles of digital signal processing have been used in genomics and proteomics by a number of researchers in the past few years. Digital .filtering and Hidden Markov Model (HMM) techniques have been applied for the identification of protein coding genes in DNA. More recently non coding genes have been emphasized by many researchers and it is now recognized that many types of non coding RNA (nc-RNA) play a major role in living organisms. Computational identification of such RNA has therefore become of great importance. These RNAs are generated from codes in the DNA, and do not code for proteins. Instead they fold into secondary structures and perform their biological function by virtue of these structures. This is what makes the computational identification of ncRNAs very challenging: it is the secondary folding structures that need to be identified rather than primary sequence structures. Such identification cannot be done with conventional HMMs because they correspond to regular grammars which are potentially incapable of identifying most secondary structures. The theory of context sensitive HMMs (cs-HMM) was recently developed towards this goal and there is strong evidence that such HMMs have great potential to identify very complicated secondary structures found in living organisms. A detailed exploration of this idea is therefore extremely timely. This is the main goal of the proposed research. For simple RNA structures such as stem-loops, tRNA cloverleaf structures, and so on, cs-HMM based algorithms have recently been developed. Algorithms that can be used for solving the alignment, scoring and training problems of csHMMs for more complex correlations will be developed in the proposed work. In order for these algorithms to be useful in biology, extensive testing on a large variety of documented sequences will also be performed. Fast algorithms for finding the optimal state sequence of an observed symbol sequence and for training pro.le-csHMMs will be developed. Recent results show that many ncRNAs play important roles in diverse gene regulatory networks. In order to build a more realistic gene regulatory network it is crucial to incorporate ncRNA genes in the network. This is another important aspect of the proposed research.

The research is exploratory and unconventional, and is at the cross roads of cutting edge signal processing theory and modern bioinformatics. Its intellectual merit comes from the fact that a deep understanding of the theory of context sensitive hidden Markov models is developed and applied to a practically interesting problem in molecular biology. The impact will clearly be in theoretical signal processing as well as in biology, where non coding genes have been shown to be of great interest in medicine and gene regulation.

As for broader impact, it is expected that the proposed research will lead to scholarly journal publications in international journals. Examples include the journals Bioinformatics, BMC bioinformatics, Proc. of the National Academy of Sciences, Nature biotechnology, and IEEE/ACM Transactions on Computational Biology and Bioinformatics. Many conference presentations will also emerge, and so will tutorial articles at various levels of depth and difficulties. Some of the work will be incorporated into the graduate curricula at the California Institute of Technology (Caltech).

Project Start
Project End
Budget Start
2006-09-01
Budget End
2008-02-29
Support Year
Fiscal Year
2006
Total Cost
$89,527
Indirect Cost
Name
California Institute of Technology
Department
Type
DUNS #
City
Pasadena
State
CA
Country
United States
Zip Code
91125