Pattern counting statistical methods have been used in many computational biology problems including: a) identification of transcription factor binding sites (TFBS) or cis-regulatory modules, b) comparison of genomic sequences and evolutionary studies, and 3) comparison of metagenomics communities. Many statistics have been developed to achieve these objectives. However, studies of properties of these statistics, e.g. power, have been lagging behind. In addition, pattern counting based methods should be very useful for the analysis of sequence data from the next generation sequencing technologies (NGS), e.g. ABI/SOLiD, and Roche 454 pyrosequencing, since these statistics do not need sequence assembly, a challenging problem in NGS. However, the available pattern counting statistics cannot be readily applied to the sequence fragment data due to the additional randomness introduced during NGS and new statistics have to be developed and studied. We recently studied the power of detecting enriched patterns in one molecular sequence and of detecting relationships between two sequences using pattern counting. Based on the results from these studies, we will achieve the following aims.
In Aim 1, we study statistics for detecting enriched patterns. 1a). Extend the power study of detecting enriched patterns to more realistic background sequences when cis- regulatory modules are present and to regulatory sequences from multiple organisms. 1b) Design and study new statistics for detecting enriched patterns based on Chip-Seq data from multiple organisms.
In Aim 2, we will develop alignment free statistics to study the relationships between organisms. 2a). Extend our recent work on alignment free sequence comparison statistics to more general evolutionary models and to design new statistics for horizontal gene transfers. 2b). Design and study new alignment free statistics for genome comparison based on short sequence reads from NGS data. The proposed projects will generate a suite of computer algorithms related to power analysis for detecting enriched pairs and alignment free genome comparison based on whole genome data or sequence fragment data from NGS. The algorithms will be disseminated through the web and R-code will be deposited in the R-library. The results from this study will be important for the study of detecting motifs and cisregulatory modules in genomic sequences and for evolutionary studies.

Public Health Relevance

The statistical power of pattern counting methods for detecting enriched patterns in one sequence and for alignment-free sequence comparison is not well understood. New statistics, efficient algorithms and user-friendly software will be developed for detecting enriched patterns and genome comparison based on next generation sequencing (NGS) data. These tools will be used to analyze several NGS data sets.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Southern California
Schools of Arts and Sciences
Los Angeles
United States
Zip Code
Song, Kai; Ren, Jie; Reinert, Gesine et al. (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 15:343-53
Wang, Ying; Liu, Lin; Chen, Lina et al. (2014) Comparison of metatranscriptomic samples based on k-tuple frequencies. PLoS One 9:e84348
Song, Kai; Ren, Jie; Zhai, Zhiyuan et al. (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 20:64-79
Ren, Jie; Song, Kai; Sun, Fengzhu et al. (2013) Multiple alignment-free sequence comparison. Bioinformatics 29:2690-8
Liu, Xuemei; Wan, Lin; Li, Jing et al. (2011) New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J Theor Biol 284:106-16