Overcoming bias and unwanted variability in next generation sequencing

Irizarry, Rafael

Abstract

Next Generation Sequencing (NGS) has become the most widely used high-throughput technology in biology. Today, NGS applications go far beyond genome sequencing and studies of DNA sequence itself to include the measurement of quantitative and dynamic outcomes underlying genomic function in development and disease. These measurements, specifically, RNA abundance, protein binding, DNA methylation, and microbiome composition, are at the core of studies undertaken by large consortia and individual labs alike. However, when measuring these quantitative outcomes, NGS data are subject to severe technological and biological biases, systematic errors, and unforeseen variability which can greatly impact downstream analyses. Only when these issues can be readily identified and addressed will the technology maximally benefit science and medicine. Our group has extensive experience developing statistical methods that transform raw high- throughput data into the ultimate measurements relied upon by biologists and clinicians. Our gene expression array preprocessing methods are practically an industry standard and our recent work on NGS applications is widely cited and used. Furthermore, Dr. Irizarry co-leads the Bioconductor project, one of the most widely used open-source projects for the development and dissemination of state-of-the-art statistical methodology. We propose to continue to leverage our experience with high-throughput technologies to develop indispensable analysis tools for NGS data in four critical, widely used applications urgently requiring reliable statistical analysis tols. At the core of our methods is the common need, across these four applications, to overcome bias, systematic error, and unforeseen variability. To aid in the development and assessment of these tools we propose experiments specifically designed to serve as benchmarks. These problems are matched well to our specific expertise and we will address them with the following aims. 1) Develop statistical methods for RNA transcript estimation that are robust to sequencing artifacts. 2) Develop statistical methods that estimate heterogenous cell composition in DNA methylation data. 3) Develop statistical methods for unbiased quantification in microbial community 16S rRNA gene sequencing studies. 4) Develop methods that account for protocol-induced bias in genome-wide enrichment scans (e.g., ChIP-seq and DNase I-seq).

Public Health Relevance

Just like the invention of the microscope led to important discoveries that greatly improved our quality of life, the ability to measure biological entities never before seen and understand their functional role in development and disease can have a great impact on human health. But just like the microscope had to be focused, next generation sequencing data needs to be properly analyzed. Our proposal is to leverage our extensive experience with high-throughput data to develop statistical solutions and software for the four widely used applications of NGS technology that most urgently need it.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 2R01HG005220-05
Application #: 8818414
Study Section: Special Emphasis Panel (ZRG1-GGG-L (03))
Program Officer: Brooks, Lisa

Project Start: 2010-08-11
Project End: 2019-02-28
Budget Start: 2015-03-10
Budget End: 2016-02-29
Support Year: 5
Fiscal Year: 2015
Total Cost: $600,000
Indirect Cost: $121,464

Institution

Name: Dana-Farber Cancer Institute
Department
Type
DUNS #: 076580745

City: Boston
State: MA
Country: United States
Zip Code: 02215

Related projects


NIH 2018 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute
NIH 2017 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$540,000
NIH 2016 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$600,000
NIH 2015 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$600,000
NIH 2012 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Johns Hopkins University	$405,900
NIH 2012 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$83,810
NIH 2011 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Johns Hopkins University	$405,900
NIH 2010 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Johns Hopkins University	$410,000

Publications

Takeda, David Y; Spisák, Sándor; Seo, Ji-Heui et al. (2018) A Somatically Acquired Enhancer of the Androgen Receptor Is a Noncoding Driver in Advanced Prostate Cancer. Cell 174:422-432.e13

Kumar, M Senthil; Slud, Eric V; Okrah, Kwame et al. (2018) Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics 19:799

Nazario-Toole, Ashley E; Robalino, Javier; Okrah, Kwame et al. (2018) The Splicing Factor RNA-Binding Fox Protein 1 Mediates the Cellular Immune Response in Drosophila melanogaster. J Immunol 201:1154-1164

Shukla, Chinmay J; McCorkindale, Alexandra L; Gerhardinger, Chiara et al. (2018) High-throughput identification of RNA nuclear enrichment sequences. EMBO J 37:

Wu, Gang; Ruben, Marc D; Schmidt, Robert E et al. (2018) Population-level rhythms in human skin with implications for circadian medicine. Proc Natl Acad Sci U S A 115:12313-12318

Hicks, Stephanie C; Townes, F William; Teng, Mingxiang et al. (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562-578

McIver, Lauren J; Abu-Ali, Galeb; Franzosa, Eric A et al. (2018) bioBakery: a meta'omic analysis environment. Bioinformatics 34:1235-1237

Sinha, Rashmi; Abu-Ali, Galeb; Vogtmann, Emily et al. (2017) Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat Biotechnol 35:1077-1086

Parker, Margaret M; Chase, Robert P; Lamb, Andrew et al. (2017) RNA sequencing identifies novel non-coding RNA and exon-specific effects associated with cigarette smoking. BMC Med Genomics 10:58

Patro, Rob; Duggal, Geet; Love, Michael I et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419

Showing the most recent 10 out of 53 publications

Comments

Be the first to comment on Rafael Irizarry's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: