Next Generation Sequencing (NGS) has become the most widely used high-throughput technology in biology. Today, NGS applications go far beyond genome sequencing and studies of DNA sequence itself to include the measurement of quantitative and dynamic outcomes underlying genomic function in development and disease. These measurements, specifically, RNA abundance, protein binding, DNA methylation, and microbiome composition, are at the core of studies undertaken by large consortia and individual labs alike. However, when measuring these quantitative outcomes, NGS data are subject to severe technological and biological biases, systematic errors, and unforeseen variability which can greatly impact downstream analyses. Only when these issues can be readily identified and addressed will the technology maximally benefit science and medicine. Our group has extensive experience developing statistical methods that transform raw high- throughput data into the ultimate measurements relied upon by biologists and clinicians. Our gene expression array preprocessing methods are practically an industry standard and our recent work on NGS applications is widely cited and used. Furthermore, Dr. Irizarry co-leads the Bioconductor project, one of the most widely used open-source projects for the development and dissemination of state-of-the-art statistical methodology. We propose to continue to leverage our experience with high-throughput technologies to develop indispensable analysis tools for NGS data in four critical, widely used applications urgently requiring reliable statistical analysis tols. At the core of our methods is the common need, across these four applications, to overcome bias, systematic error, and unforeseen variability. To aid in the development and assessment of these tools we propose experiments specifically designed to serve as benchmarks. These problems are matched well to our specific expertise and we will address them with the following aims. 1) Develop statistical methods for RNA transcript estimation that are robust to sequencing artifacts. 2) Develop statistical methods that estimate heterogenous cell composition in DNA methylation data. 3) Develop statistical methods for unbiased quantification in microbial community 16S rRNA gene sequencing studies. 4) Develop methods that account for protocol-induced bias in genome-wide enrichment scans (e.g., ChIP-seq and DNase I-seq).

Public Health Relevance

Just like the invention of the microscope led to important discoveries that greatly improved our quality of life, the ability to measure biological entities never before seen and understand their functional role in development and disease can have a great impact on human health. But just like the microscope had to be focused, next generation sequencing data needs to be properly analyzed. Our proposal is to leverage our extensive experience with high-throughput data to develop statistical solutions and software for the four widely used applications of NGS technology that most urgently need it.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG005220-06
Application #
9032509
Study Section
Special Emphasis Panel (ZRG1-GGG-L (03))
Program Officer
Brooks, Lisa
Project Start
2010-08-11
Project End
2019-02-28
Budget Start
2016-03-01
Budget End
2017-02-28
Support Year
6
Fiscal Year
2016
Total Cost
$600,000
Indirect Cost
$139,819
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215
Takeda, David Y; Spisák, Sándor; Seo, Ji-Heui et al. (2018) A Somatically Acquired Enhancer of the Androgen Receptor Is a Noncoding Driver in Advanced Prostate Cancer. Cell 174:422-432.e13
Kumar, M Senthil; Slud, Eric V; Okrah, Kwame et al. (2018) Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics 19:799
Nazario-Toole, Ashley E; Robalino, Javier; Okrah, Kwame et al. (2018) The Splicing Factor RNA-Binding Fox Protein 1 Mediates the Cellular Immune Response in Drosophila melanogaster. J Immunol 201:1154-1164
Shukla, Chinmay J; McCorkindale, Alexandra L; Gerhardinger, Chiara et al. (2018) High-throughput identification of RNA nuclear enrichment sequences. EMBO J 37:
Wu, Gang; Ruben, Marc D; Schmidt, Robert E et al. (2018) Population-level rhythms in human skin with implications for circadian medicine. Proc Natl Acad Sci U S A 115:12313-12318
Hicks, Stephanie C; Townes, F William; Teng, Mingxiang et al. (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562-578
McIver, Lauren J; Abu-Ali, Galeb; Franzosa, Eric A et al. (2018) bioBakery: a meta'omic analysis environment. Bioinformatics 34:1235-1237
Patro, Rob; Duggal, Geet; Love, Michael I et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419
Nakayama, Robert T; Pulice, John L; Valencia, Alfredo M et al. (2017) SMARCB1 is required for widespread BAF complex-mediated activation of enhancers and bivalent promoters. Nat Genet 49:1613-1623
Teng, Mingxiang; Irizarry, Rafael A (2017) Accounting for GC-content bias reduces systematic errors and batch effects in ChIP-seq data. Genome Res 27:1930-1938

Showing the most recent 10 out of 53 publications