Next-generation sequencing technologies are capable of producing tens of millions of sequence reads during each instrument run, and are quickly being applied in diverse types of experiments (e.g. RNA-Seq, miRNA-Seq, ChIP-Seq, BS-seq, CNV-Seq) to address biomedical questions by cost-effectively generating genome-wide datasets. While sequencing has been promoted as overcoming longstanding limitations of microarray-based studies, its data files are much larger than for microarrays, and its diverse data types raise similar as well as novel statistical and computational challenges. There is a pressing need for statistical and computational tools to address what leaders in the field have stated are the largest problems: data analysis and data integration. We propose to develop a comprehensive and coordinated set of statistical methods for high throughput sequencing (HTS) that directly address many important data analysis problems in epigenomics. Specifically we plan to address the following computational and statistical challenges facing researchers conducting HTS experiments: 1) develop sensitive statistical methods for the analysis of ChIP-seq data both for single- and paired-end-tag runs, particularly the focusing on applications in genome-wide profiling of nucleosome positions. 2) develop statistical methods for the analysis of BS-seq data, producing base-level DNA methylation profiles. 3) develop new statistical tools and methods for data integration in order to gain new biological insights about global transcription and regulation. We also plan to apply these approaches to a variety of high throughput sequencing data sets to demonstrate the relevance and utility of our methods. We plan to work with stimulated STAT1 and STAT3 data, and data from the ETS transcription factor family and its cofactors, for which we have already gathered significant data through our collaborations, including transcription factors, histone marks, DNAse I hypersensitivity and gene expression.

Public Health Relevance

We propose to develop a comprehensive and coordinated set of statistical methods for high throughput sequencing (HTS) that directly address many important data analysis problems in epigenomics. In particular, we plan to integrate data from multiple sources including expression, transcription factor binding, nucleosome positioning, histone marks and DNA methylation to better understand the mechanisms that regulate the behavior of a cell. Much of our proposal involves not just the development of new statistical and computational methods, but also the design, implementation and delivery of software tools that support these ideas. The many useful applications of next-generation sequencing with assure that or well- developed methods will have a broad impact in molecular biology, specifically in transcription regulation, chromatin dynamics, development, and cancer.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Boston University
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Bodily, Paul M; Fujimoto, M Stanley; Snell, Quinn et al. (2016) ScaffoldScaffolder: solving contig orientation via bidirected to directed graph reduction. Bioinformatics 32:17-24
Piccolo, Stephen R; Hoffman, Laura M; Conner, Thomas et al. (2016) Integrative analyses reveal signaling pathways underlying familial breast cancer susceptibility. Mol Syst Biol 12:860
Yazdani, Neema; Parker, Clarissa C; Shen, Ying et al. (2015) Hnrnph1 Is A Quantitative Trait Gene for Methamphetamine Sensitivity. PLoS Genet 11:e1005713
Piccolo, Stephen R; Andrulis, Irene L; Cohen, Adam L et al. (2015) Gene-expression patterns in peripheral blood classify familial breast cancer susceptibility. BMC Med Genomics 8:72
Whipple, Joseph M; Youssef, Osama A; Aruscavage, P Joseph et al. (2015) Genome-wide profiling of the C. elegans dsRNAome. RNA 21:786-800
Mortenson, Jeffrey B; Heppler, Lisa N; Banks, Courtney J et al. (2015) Histone deacetylase 6 (HDAC6) promotes the pro-survival activity of 14-3-3? via deacetylation of lysines within the 14-3-3? binding pocket. J Biol Chem 290:12487-96
Hong, Changjin; Manimaran, Solaiappan; Johnson, William Evan (2014) PathoQC: Computationally Efficient Read Preprocessing and Quality Control for High-Throughput Sequencing Data Sets. Cancer Inform 13:167-76
Fujimoto, M; Bodily, Paul M; Okuda, Nozomu et al. (2014) Effects of error-correction of heterozygous next-generation sequencing data. BMC Bioinformatics 15 Suppl 7:S3
Byrd, Allyson L; Perez-Rogers, Joseph F; Manimaran, Solaiappan et al. (2014) Clinical PathoScope: rapid alignment and filtration for accurate pathogen identification in clinical samples using unassembled sequencing data. BMC Bioinformatics 15:262
Francis, Owen E; Bendall, Matthew; Manimaran, Solaiappan et al. (2013) Pathoscope: species identification and strain attribution with unassembled sequencing data. Genome Res 23:1721-9

Showing the most recent 10 out of 27 publications