Next-generation sequencing technologies are capable of producing tens of millions of sequence reads during each instrument run, and are quickly being applied in diverse types of experiments (e.g. RNA-Seq, miRNA-Seq, ChIP-Seq, BS-seq, CNV-Seq) to address biomedical questions by cost-effectively generating genome-wide datasets. While sequencing has been promoted as overcoming longstanding limitations of microarray-based studies, its data files are much larger than for microarrays, and its diverse data types raise similar as well as novel statistical and computational challenges. There is a pressing need for statistical and computational tools to address what leaders in the field have stated are the largest problems: data analysis and data integration. We propose to develop a comprehensive and coordinated set of statistical methods for high throughput sequencing (HTS) that directly address many important data analysis problems in epigenomics. Specifically we plan to address the following computational and statistical challenges facing researchers conducting HTS experiments: 1) develop sensitive statistical methods for the analysis of ChIP-seq data both for single- and paired-end-tag runs, particularly the focusing on applications in genome-wide profiling of nucleosome positions. 2) develop statistical methods for the analysis of BS-seq data, producing base-level DNA methylation profiles. 3) develop new statistical tools and methods for data integration in order to gain new biological insights about global transcription and regulation. We also plan to apply these approaches to a variety of high throughput sequencing data sets to demonstrate the relevance and utility of our methods. We plan to work with stimulated STAT1 and STAT3 data, and data from the ETS transcription factor family and its cofactors, for which we have already gathered significant data through our collaborations, including transcription factors, histone marks, DNAse I hypersensitivity and gene expression.
We propose to develop a comprehensive and coordinated set of statistical methods for high throughput sequencing (HTS) that directly address many important data analysis problems in epigenomics. In particular, we plan to integrate data from multiple sources including expression, transcription factor binding, nucleosome positioning, histone marks and DNA methylation to better understand the mechanisms that regulate the behavior of a cell. Much of our proposal involves not just the development of new statistical and computational methods, but also the design, implementation and delivery of software tools that support these ideas. The many useful applications of next-generation sequencing with assure that or well- developed methods will have a broad impact in molecular biology, specifically in transcription regulation, chromatin dynamics, development, and cancer.
Showing the most recent 10 out of 27 publications