Next-generation sequencing technologies are capable of producing tens of millions of sequence reads during each instrument run, and are quickly being applied in diverse types of experiments (e.g. RNA-Seq, miRNA-Seq, ChIP-Seq, BS-seq, CNV-Seq) to address biomedical questions by cost-effectively generating genome-wide datasets. While sequencing has been promoted as overcoming longstanding limitations of microarray-based studies, its data files are much larger than for microarrays, and its diverse data types raise similar as well as novel statistical and computational challenges. There is a pressing need for statistical and computational tools to address what leaders in the field have stated are the largest problems: data analysis and data integration. We propose to develop a comprehensive and coordinated set of statistical methods for high throughput sequencing (HTS) that directly address many important data analysis problems in epigenomics. Specifically we plan to address the following computational and statistical challenges facing researchers conducting HTS experiments: 1) develop sensitive statistical methods for the analysis of ChIP-seq data both for single- and paired-end-tag runs, particularly the focusing on applications in genome-wide profiling of nucleosome positions. 2) develop statistical methods for the analysis of BS-seq data, producing base-level DNA methylation profiles. 3) develop new statistical tools and methods for data integration in order to gain new biological insights about global transcription and regulation. We also plan to apply these approaches to a variety of high throughput sequencing data sets to demonstrate the relevance and utility of our methods. We plan to work with stimulated STAT1 and STAT3 data, and data from the ETS transcription factor family and its cofactors, for which we have already gathered significant data through our collaborations, including transcription factors, histone marks, DNAse I hypersensitivity and gene expression.

Public Health Relevance

We propose to develop a comprehensive and coordinated set of statistical methods for high throughput sequencing (HTS) that directly address many important data analysis problems in epigenomics. In particular, we plan to integrate data from multiple sources including expression, transcription factor binding, nucleosome positioning, histone marks and DNA methylation to better understand the mechanisms that regulate the behavior of a cell. Much of our proposal involves not just the development of new statistical and computational methods, but also the design, implementation and delivery of software tools that support these ideas. The many useful applications of next-generation sequencing with assure that or well- developed methods will have a broad impact in molecular biology, specifically in transcription regulation, chromatin dynamics, development, and cancer.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG005692-04
Application #
8451427
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
2010-06-01
Project End
2015-02-28
Budget Start
2013-03-01
Budget End
2014-02-28
Support Year
4
Fiscal Year
2013
Total Cost
$321,060
Indirect Cost
$75,434
Name
Boston University
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
604483045
City
Boston
State
MA
Country
United States
Zip Code
02118
Byrd, Allyson L; Perez-Rogers, Joseph F; Manimaran, Solaiappan et al. (2014) Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics 15:262
Fujimoto, M; Bodily, Paul M; Okuda, Nozomu et al. (2014) Effects of error-correction of heterozygous next-generation sequencing data. BMC Bioinformatics 15 Suppl 7:S3
Piccolo, Stephen R; Withers, Michelle R; Francis, Owen E et al. (2013) Multiplatform single-sample estimates of transcriptional activation. Proc Natl Acad Sci U S A 110:17778-83
Woo, Sangsoon; Zhang, Xuekui; Sauteraud, Renan et al. (2013) PING 2.0: an R/Bioconductor package for nucleosome positioning using next-generation sequencing data. Bioinformatics 29:2049-50
Imholte, Greg C; Scott-Boyer, Marie-Pier; Labbe, Aurelie et al. (2013) iBMQ: a R/Bioconductor package for integrated Bayesian modeling of eQTL data. Bioinformatics 29:2797-8
Tennant, B R; Robertson, A G; Kramer, M et al. (2013) Identification and analysis of murine pancreatic islet enhancers. Diabetologia 56:542-52
Francis, Owen E; Bendall, Matthew; Manimaran, Solaiappan et al. (2013) Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res 23:1721-9
Zhang, Xuekui; Robertson, Gordon; Woo, Sangsoon et al. (2012) Probabilistic inference for nucleosome positioning with MNase-based or sonicated short-read data. PLoS One 7:e32095
Lyon, Gholson J; Jiang, Tao; Van Wijk, Richard et al. (2011) Exome sequencing and unrelated findings in the context of complex disease research: ethical and clinical implications. Discov Med 12:41-55
Rope, Alan F; Wang, Kai; Evjenth, Rune et al. (2011) Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. Am J Hum Genet 89:28-43

Showing the most recent 10 out of 11 publications