Next Generation Sequencing (NGS) has become the most widely used high-throughput technology in biology. Today, NGS applications go far beyond genome sequencing and studies of DNA sequence itself to include the measurement of quantitative and dynamic outcomes underlying genomic function in development and disease. These measurements, specifically, RNA abundance, protein binding, DNA methylation, and microbiome composition, are at the core of studies undertaken by large consortia and individual labs alike. However, when measuring these quantitative outcomes, NGS data are subject to severe technological and biological biases, systematic errors, and unforeseen variability which can greatly impact downstream analyses. Only when these issues can be readily identified and addressed will the technology maximally benefit science and medicine. Our group has extensive experience developing statistical methods that transform raw high- throughput data into the ultimate measurements relied upon by biologists and clinicians. Our gene expression array preprocessing methods are practically an industry standard and our recent work on NGS applications is widely cited and used. Furthermore, Dr. Irizarry co-leads the Bioconductor project, one of the most widely used open-source projects for the development and dissemination of state-of-the-art statistical methodology. We propose to continue to leverage our experience with high-throughput technologies to develop indispensable analysis tools for NGS data in four critical, widely used applications urgently requiring reliable statistical analysis tols. At the core of our methods is the common need, across these four applications, to overcome bias, systematic error, and unforeseen variability. To aid in the development and assessment of these tools we propose experiments specifically designed to serve as benchmarks. These problems are matched well to our specific expertise and we will address them with the following aims. 1) Develop statistical methods for RNA transcript estimation that are robust to sequencing artifacts. 2) Develop statistical methods that estimate heterogenous cell composition in DNA methylation data. 3) Develop statistical methods for unbiased quantification in microbial community 16S rRNA gene sequencing studies. 4) Develop methods that account for protocol-induced bias in genome-wide enrichment scans (e.g., ChIP-seq and DNase I-seq).

Public Health Relevance

Just like the invention of the microscope led to important discoveries that greatly improved our quality of life, the ability to measure biological entities never before seen and understand their functional role in development and disease can have a great impact on human health. But just like the microscope had to be focused, next generation sequencing data needs to be properly analyzed. Our proposal is to leverage our extensive experience with high-throughput data to develop statistical solutions and software for the four widely used applications of NGS technology that most urgently need it.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-GGG-L (03))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dana-Farber Cancer Institute
United States
Zip Code
Kumar, M Senthil; Slud, Eric V; Okrah, Kwame et al. (2018) Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics 19:799
Nazario-Toole, Ashley E; Robalino, Javier; Okrah, Kwame et al. (2018) The Splicing Factor RNA-Binding Fox Protein 1 Mediates the Cellular Immune Response in Drosophila melanogaster. J Immunol 201:1154-1164
Shukla, Chinmay J; McCorkindale, Alexandra L; Gerhardinger, Chiara et al. (2018) High-throughput identification of RNA nuclear enrichment sequences. EMBO J 37:
Wu, Gang; Ruben, Marc D; Schmidt, Robert E et al. (2018) Population-level rhythms in human skin with implications for circadian medicine. Proc Natl Acad Sci U S A 115:12313-12318
Hicks, Stephanie C; Townes, F William; Teng, Mingxiang et al. (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562-578
McIver, Lauren J; Abu-Ali, Galeb; Franzosa, Eric A et al. (2018) bioBakery: a meta'omic analysis environment. Bioinformatics 34:1235-1237
Takeda, David Y; Spisák, Sándor; Seo, Ji-Heui et al. (2018) A Somatically Acquired Enhancer of the Androgen Receptor Is a Noncoding Driver in Advanced Prostate Cancer. Cell 174:422-432.e13
Sinha, Rashmi; Abu-Ali, Galeb; Vogtmann, Emily et al. (2017) Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat Biotechnol 35:1077-1086
Parker, Margaret M; Chase, Robert P; Lamb, Andrew et al. (2017) RNA sequencing identifies novel non-coding RNA and exon-specific effects associated with cigarette smoking. BMC Med Genomics 10:58
Patro, Rob; Duggal, Geet; Love, Michael I et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419

Showing the most recent 10 out of 53 publications