Molecular Sequence Analysis Using Word Counts: Statistics Power and Applications

Sun, Fengzhu

Abstract

Pattern counting statistical methods have been used in many computational biology problems including: a) identification of transcription factor binding sites (TFBS) or cis-regulatory modules, b) comparison of genomic sequences and evolutionary studies, and 3) comparison of metagenomics communities. Many statistics have been developed to achieve these objectives. However, studies of properties of these statistics, e.g. power, have been lagging behind. In addition, pattern counting based methods should be very useful for the analysis of sequence data from the next generation sequencing technologies (NGS), e.g. ABI/SOLiD, and Roche 454 pyrosequencing, since these statistics do not need sequence assembly, a challenging problem in NGS. However, the available pattern counting statistics cannot be readily applied to the sequence fragment data due to the additional randomness introduced during NGS and new statistics have to be developed and studied. We recently studied the power of detecting enriched patterns in one molecular sequence and of detecting relationships between two sequences using pattern counting. Based on the results from these studies, we will achieve the following aims.
In Aim 1, we study statistics for detecting enriched patterns. 1a). Extend the power study of detecting enriched patterns to more realistic background sequences when cis- regulatory modules are present and to regulatory sequences from multiple organisms. 1b) Design and study new statistics for detecting enriched patterns based on Chip-Seq data from multiple organisms.
In Aim 2, we will develop alignment free statistics to study the relationships between organisms. 2a). Extend our recent work on alignment free sequence comparison statistics to more general evolutionary models and to design new statistics for horizontal gene transfers. 2b). Design and study new alignment free statistics for genome comparison based on short sequence reads from NGS data. The proposed projects will generate a suite of computer algorithms related to power analysis for detecting enriched pairs and alignment free genome comparison based on whole genome data or sequence fragment data from NGS. The algorithms will be disseminated through the web and R-code will be deposited in the R-library. The results from this study will be important for the study of detecting motifs and cisregulatory modules in genomic sequences and for evolutionary studies.

Public Health Relevance

The statistical power of pattern counting methods for detecting enriched patterns in one sequence and for alignment-free sequence comparison is not well understood. New statistics, efficient algorithms and user-friendly software will be developed for detecting enriched patterns and genome comparison based on next generation sequencing (NGS) data. These tools will be used to analyze several NGS data sets.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Exploratory/Developmental Grants (R21)
Project #: 5R21HG006199-02
Application #: 8305462
Study Section: Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer: Bonazzi, Vivien

Project Start: 2011-07-22
Project End: 2014-04-30
Budget Start: 2012-05-01
Budget End: 2014-04-30
Support Year: 2
Fiscal Year: 2012
Total Cost: $245,750
Indirect Cost: $95,750

Institution

Name: University of Southern California
Department: Biology
Type: Schools of Arts and Sciences
DUNS #: 072933393

City: Los Angeles
State: CA
Country: United States
Zip Code: 90089

Related projects


NIH 2012 R21 HG	Molecular Sequence Analysis Using Word Counts: Statistics Power and Applications Sun, Fengzhu / University of Southern California	$245,750
NIH 2011 R21 HG	Molecular Sequence Analysis Using Word Counts: Statistics Power and Applications Sun, Fengzhu / University of Southern California	$203,750

Publications

Song, Kai; Ren, Jie; Reinert, Gesine et al. (2014) New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing. Brief Bioinform 15:343-53

Wang, Ying; Liu, Lin; Chen, Lina et al. (2014) Comparison of metatranscriptomic samples based on k-tuple frequencies. PLoS One 9:e84348

Ren, Jie; Song, Kai; Sun, Fengzhu et al. (2013) Multiple alignment-free sequence comparison. Bioinformatics 29:2690-8

Song, Kai; Ren, Jie; Zhai, Zhiyuan et al. (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 20:64-79

Jiang, Bai; Song, Kai; Ren, Jie et al. (2012) Comparison of metagenomic samples using sequence signatures. BMC Genomics 13:730

Liu, Xuemei; Wan, Lin; Li, Jing et al. (2011) New powerful statistics for alignment-free sequence comparison under a pattern transfer model. J Theor Biol 284:106-16

Reinert, Gesine; Chew, David; Sun, Fengzhu et al. (2009) Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16:1615-34

Comments

Be the first to comment on Fengzhu Sun's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: