The PI aims to develop new tools for the systematic analysis of patterns in random sequences with an arbitrary correlation structure. The project considers theoretical and computational aspects associated with possibly non-regular patterns in non-Markovian sequences. Previous work by the PI has shown the existence and uniqueness of optimal Markovian structures to keep track of the number of matches with a given pattern in a random sequence. At a theoretical level, the project aims to identify ergodic Markovian structures in embeddings that result in non-ergodic behavior due to small perturbations, and to use these ergodic structures to characterize the asymptotic distribution of the number of matches with a pattern in a non-Markovian sequence. It also compares the entropy of the optimal Markovian embedding of a non-Markovian sequence with that of the original sequence. At a computational level, the PI aims to apply generating function techniques in conjunction with Markovian embeddings to characterize the joint asymptotic distribution of the number of matches with several patterns. The PI also aims to quantify the error in approximating the number of matches with a pattern in a sequence of a moderate size with that of a Binomial distribution. A secondary goal of the project is to develop a method for the symbolic specification of branching processes that will represent the solutions of non-linear o.d.e.'s probabilistically and to explore new connections between these and the analysis of patterns in non-Markovian sequences.

The analysis of patterns in random sequences lies at the core of several emerging fields such as computational biology, security systems, speech recognition and text mining. Random sequences of characters are used to model diverse phenomena ranging from written text to genomic sequences to audit files. In these, exceptional patterns, i.e., over- or under-represented words, may provide a great deal of insight or knowledge. For instance, a widely used heuristic is that overrepresented patterns in DNA may be key for gene expression whereas underrepresented patterns may interfere with this process. As another example, hackers leave traces of their intrusions into secured databases in audit files and unusual patterns may be used to warn of potential security breaches. However, one cannot assess how truly exceptional a pattern is without a bona fide statistical model of the text in which it is immersed. The prevalent models cannot accommodate the long-range correlations present in most types of text. For example, the characters occurring in written text or audit files follow syntax rules and, in RNA sequences, palindromic structures induced by base-pairing convey genome-wide correlations. Unfortunately, the available techniques to assess how exceptional a pattern is, cannot systematically handle the more realistic models. Furthermore, the PI has recently shown that the widely used paradigm that the number of matches with a pattern in a long text is approximately Gaussian distributed does not necessarily apply when long-range correlations are present. Due to these considerations the PI aims to develop new qualitative and quantitative tools to address systematically the occurrence of highly complex patterns under more realistic statistical models of text.

Project Report

This project addressed various aspects related to non-Markovianity i.e. sequences that evolve with strong correlations on previous values. Such sequences lie at the core of many applied fields, including computational biology, computer security and even speech recognition. Non-Markovian sequences are very challenging to analyze and have virtually no principles in comparison to the vast theory behind Markovian sequences. The main findings of this project are a first attempt to identify general principles and computational methods to harness non-Markovianity usually via embedding techniques. The project provided a first advance toward a 10-year old conjecture on the asymptotic Normality of hidden patterns in non-Markovian sequences generated by holomorphic dynamical sources [4]. Hidden patterns are important in computational biology to characterize the secondary structure of certain RNA sequences. It also identified a broad class of non-Markovian models where pattern problems are amenable to Markovian embeddings [7], a technique that uses sojourn-times in auxiliary Markov chains to characterize the distribution of matches with regular patterns in random sequences. Because of the later, the project also focused on methods to approximate the distribution of sojourn-times in Markov chains over Polish state spaces, particularly in the regime where the well-known Gaussian and compound Poisson approximations are expected to perform poorly [6]. The coupling ideas underlying the later led to the unexpected finding that Doeblin’s ergodicity coefficient is sub-multiplicative, which in turn led to a first principles proof of Doeblin’s characterization of the weak ergodicity of non-homogeneous finite Markov chains [3,6]. It was also shown that associated with any additive functional of a non-Markovian sequence there is an essentially unique and "optimal" Markovian embedding (i.e. with the "least" number of states) to represent the distribution of the functional in terms of another additive functional but of a first-order homogeneous Markov chain [7]. The findings of the project are likely to have an impact in areas such as RNA-biology, genomics and metagenomics. In fact, the above findings led to approximation of p-values associated with the dopamine motif in Caenorhabditis elegans [6]. The project also identified a non-Markovian model that has the potential to describe non-coding regions in hyperthermophiles with an elevated GC-content, and under which the asymptotic frequency of matches with a certain regular pattern is non-Gaussian [7]. More synergistic activities of the project led to evidence for the neutral network hypothesis in RNA (which states that functional genomic sequences may easily mutate to sequences with additional biological functions) [2], to predictions of coverage of a sample from a discrete distribution [1], and estimators of the dissimilarity of two discrete distributions i.e. the probability that a draw from one distribution is not observed in a random sample from the other [5]. The later revealed meaningful similarities and differences between various microbial environments sampled using V35 16S rRNA data from Human Microbiome Project, and may find applications in SELEX experiments—a laboratory technique in which an initial pool of synthesized random RNA sequences is repeatedly screened to yield a pool containing only sequences with given biological functions—where the dissimilarity of the pools is now related to the probability of finding binding site solutions in one pool that are absent in the other. The project trained and partially supported three graduate students, one of which obtained a M.S. and another a Ph.D. in Applied Mathematics. It also aided these students to participate in a Summer School in Applied Probability in Canada, and in international conferences in Austria and Poland. The project allowed the PI to travel and collaborate on the project with experts in France, and to participate of various national and international conferences, including the 21-st and 22-nd International Meeting on Probabilistic, Combinatorial and Asymptotic Methods for the Analysis of Algorithms in Austria and Poland, respectively, the 2012 International Workshop on Applied Probability in Israel, and the 2012 RNA Workshop in Spain. Publications. [1] Prediction of unseeing proportions in urn models with restricted sampling. M.E. Lladser. Proceedings of the 2009 Analysis of Algorithms and Combinatorics workshop, pp. 85-91 (2009) [2] Natural and artificial RNAs occupy the same restricted region of sequence space. R. Kennedy, M.E. Lladser, Z. Wu, C. Zhang, M. Yarus, H. De Sterck, R. Knight. RNA Journal, 16(2): 280–289 (2010) [3] Occupancy distributions in Markov chains via Doeblin's ergodicity coefficient. S. Chestnut, M.E. Lladser. Discrete Mathematics and Theoretical Computer Science Proceedings, AM:79-92 (2010) [4] Toward the asymptotic count of bi-modular hidden patterns under probabilistic dynamical sources: a case study. L. Lhothe, M.E. Lladser. Discrete Mathematics and Theoretical Computer Science Proceedings, AQ: 425-452 (2012) [5] Estimation of Distribution Overlap of Urn Models. J. Hampton, M.E. Lladser. PloS ONE, 7(11): e42368 (2012) [6] Approximation of sojourn-times via maximal couplings: motif frequency distributions. M.E. Lladser, S. Chestnut. J. Math. Biol. DOI 10.1007/s00285-013-0690-6 (2013) [7] Markovian embeddings of non-Markovian sequences: statistics of additive functionals. M.E. Lladser (submitted)

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0805950
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-07-01
Budget End
2013-06-30
Support Year
Fiscal Year
2008
Total Cost
$180,000
Indirect Cost
Name
University of Colorado at Boulder
Department
Type
DUNS #
City
Boulder
State
CO
Country
United States
Zip Code
80309