Pattern counting statistical methods have been used in many computational biology problems including: a) identification of transcription factor binding sites (TFBS) or cis-regulatory modules, b) comparison of genomic sequences and evolutionary studies, and 3) comparison of metagenomics communities. Many statistics have been developed to achieve these objectives. However, studies of properties of these statistics, e.g. power, have been lagging behind. In addition, pattern counting based methods should be very useful for the analysis of sequence data from the next generation sequencing technologies (NGS), e.g. ABI/SOLiD, and Roche 454 pyrosequencing, since these statistics do not need sequence assembly, a challenging problem in NGS. However, the available pattern counting statistics cannot be readily applied to the sequence fragment data due to the additional randomness introduced during NGS and new statistics have to be developed and studied. We recently studied the power of detecting enriched patterns in one molecular sequence and of detecting relationships between two sequences using pattern counting. Based on the results from these studies, we will achieve the following aims.
In Aim 1, we study statistics for detecting enriched patterns. 1a). Extend the power study of detecting enriched patterns to more realistic background sequences when cis- regulatory modules are present and to regulatory sequences from multiple organisms. 1b) Design and study new statistics for detecting enriched patterns based on Chip-Seq data from multiple organisms.
In Aim 2, we will develop alignment free statistics to study the relationships between organisms. 2a). Extend our recent work on alignment free sequence comparison statistics to more general evolutionary models and to design new statistics for horizontal gene transfers. 2b). Design and study new alignment free statistics for genome comparison based on short sequence reads from NGS data. The proposed projects will generate a suite of computer algorithms related to power analysis for detecting enriched pairs and alignment free genome comparison based on whole genome data or sequence fragment data from NGS. The algorithms will be disseminated through the web and R-code will be deposited in the R-library. The results from this study will be important for the study of detecting motifs and cisregulatory modules in genomic sequences and for evolutionary studies.
The statistical power of pattern counting methods for detecting enriched patterns in one sequence and for alignment-free sequence comparison is not well understood. New statistics, efficient algorithms and user-friendly software will be developed for detecting enriched patterns and genome comparison based on next generation sequencing (NGS) data. These tools will be used to analyze several NGS data sets.