In this project the investigator studies the computation of distributions of patterns and statistics in sequences through auxiliary Markov chains. In the method, Markovian structure in the original sequence is exploited to associate an auxiliary Markov chain with the sequence in such a manner that an event of interest in the original sequence occurs if and only if the auxiliary Markov chain lies in a class of states that corresponds to the event. Once the auxiliary chain is set up, probabilities for the event may be computed by tracking movements through the chain and then extracting the desired probabilities. The goals of this work are threefold: (1) to compute distributions of complex patterns that have not been addressed to date; (2) to apply probabilistic tools that are developed to statistical testing and data analysis; and (3) to quantify uncertainty in statistics of labeled and segmented data modeled by probabilistic graphical models. These goals are integrated, since probabilistic approaches to computing distributions of patterns and statistics provide the mathematical tools necessary for the statistical applications of goals (2) and (3), and in turn those applications drive the need for computing distributions in increasingly complex situations. Whereas satisfying the first two goals will provide an important contribution to the literature, the major contribution of the research is represented by goal (3). The computation of sampling distributions of statistics of hidden state sequences provides a method of quantifying uncertainty in labeled and segmented data, an area that has not been adequately addressed. In cases where one is interested in inference on statistics of labeled data, a typical approach is to determine the most likely sequence of states given the observations, and then obtain the value of the statistic of interest from that state sequence. However, whereas the most likely states are optimal if one is interested in the best set of labels, it may not be so for inference on statistics of the labels. This work provides a novel approach to compute the exact sampling distribution of statistics of labeled data, providing a means for more accurate inference. Sensitivity of computed distributions to estimated parameters and applications to change points will also be considered.
The need for distributional properties associated with patterns and statistics in sequences, both realizations of data emanating from a model and hidden sequences used to label and segment observed data, arises in many practical fields of study with massive data sets, such as bioinformatics, time series, information theory, economics, data mining, and quality control. In this research computational tools are developed for computing such distributions. Results for distributions of patterns and statistics may be applied to many practical problems, such as detecting genes, promoters, or other functionally significant patterns in DNA sequences, and determining probabilities related to classification of observations in health-related studies, change points that indicate new regimes in economic data, patterns that indicate an intrusion, or of patterns associated with surveillance work. The theory may be used to compute distributions of patterns in underlying sequences that are corrupted by noise or missing observations, and also distributions of statistics that are intractable by combinatorial or other means. Thus this research facilitates new scientific studies that rely on results for patterns or statistics that have not been computed to date.