This project targets the development and application of tools for efficient computation of distributions associated with patterns and statistics in random sequences, both realizations of observation sequences as well as hidden state sequences conditional on observed data. Due to the massive size of data sets that are available, computational methods that are not only accurate but also efficient are important. The proposed research seeks to contribute in this area. Minimal deterministic finite automata, probability generating functions, the sum-product algorithm and matrix-vector updates are the primary tools used to form algorithms for efficiently computing distributions of patterns and statistics. The goals of this research are threefold: (i) to further develop efficient methods for quantifying uncertainty in statistics of hidden state sequences of probabilistic graphical models; (ii) to efficiently compute exact probability distributions of complex patterns that have not previously been computed; (iii) to apply the probabilistic tools of (i) and (ii) to statistical tests and data analysis. The quantification of uncertainty in statistics of hidden states is frequently dealt with by determining the state sequence that is optimal for a particular criterion, with the statistic then evaluated from the optimal state sequence in a deterministic fashion. However, that approach does not account for uncertainty in the states. An alternate approach is to sample from the conditional distribution of states given the observed data, and then approximate the distribution of the statistic empirically. However, many samples are needed so that the approximation is accurate, leading to problems with scalability. We give a way to compute exact distributions in an efficient manner. The goals of the project are integrated, in that distributions of patterns and statistics are needed for statistical inference in applications, and in turn those applications drive the need for computing distributions in increasingly complex situations. Objectives of the work include computing a model-based distribution of prediction error rates for protein-protein interactions, computing exact distributions of coverage of spaced seeds for homology searches in DNA sequences, and computing the exact distribution of the one-dimensional scan statistic for multi-state higher-order Markovian trials.

The need for distributional properties associated with patterns and statistics in random sequences arises in many practical fields of study with massive data sets, such as bioinformatics, time series, information theory, economics, data mining, and quality control. In this research computational tools are developed for computing such distributions. Results for distributions of patterns and statistics may be applied to many practical problems, such as detecting genes, promoters, or other functionally significant patterns in DNA sequences, determining probabilities related to classifications of observations in health-related studies, change points that indicate new regimes in economic data, patterns that indicate an intrusion in a computer system, or patterns associated with surveillance work. The theory will be used to compute distributions of statistics that are intractable by combinatorial or other means, and to provide exact probabilities in an efficient manner in situations that are typically handled by simulation of many data sets. Thus this research facilitates new scientific studies that rely on results for distributions associated with patterns or statistics that have not been computed to date.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1107084
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2011
Total Cost
$100,000
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695