Analysis Tools and Software for Second Generation Sequencing Data

Irizarry, Rafael

Abstract

Second-generation sequencing (sec-gen) technology is poised to radically change how genomic data is obtained and used. Capable of sequencing millions of short strands of DNA in parallel, this technology can be used to assemble complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to sequence the genomes of approximately 1,200 people. The possibility of comparative analysis at the sequence level of a large number of samples across multiple populations may be achievable within the next five years. These datasets also present unprecedented challenges in statistical analysis and data management. For example, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Furthermore, sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequencing reads is of utmost importance. Properly relating this uncertainty to the true underlying variation in the genome, especially, variation between and among populations will be essential for projects that use sec-gen sequencing data to meet their scientific goals. Although genome sequencing is the application that most attention has received, sec-gen technology is also being used to produce quantitative measurements related to applications previously associated with microarrays. Of these, chromatin immunoprecipitation followed by sequencing (ChIP- Seq) has been the most successful. Existing tools have been developed for analyzing one sample at a time. Methodology for drawing inference from multiple samples has not yet been developed. The demand for such methods will increase rapidly as the technology becomes more economical and multiple samples become standard. Other applications for which statistical methodology is needed are RNA and microRNA transcription analysis. In all these sequencing applications, a number of critical steps are required to convert raw intensity measures into the sequence reads that will be used in down-stream analysis. Ad-hoc approaches, that assign weights to each base call, are unsuitable. Our goal is to create a sound and unified statistical and computational methodology for representing and managing uncertainty throughout the sec-gen sequencing data analysis pipeline built on a robust, modular and extensible software platform.

Public Health Relevance

Second-generation sequencing technology is poised to radically change how genomic data is obtained and used. These datasets also present unprecedented challenges in statistical analysis and modeling and quantifying uncertainty inherent in the generation of sequencing reads is of utmost importance. We will develop data analysis tools for widely used applications using statistical methods that account for this uncertainty.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG005220-01
Application #: 7765408
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Brooks, Lisa

Project Start: 2010-08-11
Project End: 2013-05-31
Budget Start: 2010-08-11
Budget End: 2011-05-31
Support Year: 1
Fiscal Year: 2010
Total Cost: $410,000
Indirect Cost

Institution

Name: Johns Hopkins University
Department: Biostatistics & Other Math Sci
Type: Schools of Public Health
DUNS #: 001910777

City: Baltimore
State: MD
Country: United States
Zip Code: 21218

Related projects


NIH 2018 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute
NIH 2017 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$540,000
NIH 2016 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$600,000
NIH 2015 R01 HG	Overcoming bias and unwanted variability in next generation sequencing Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$600,000
NIH 2012 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Johns Hopkins University	$405,900
NIH 2012 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Dana-Farber Cancer Institute	$83,810
NIH 2011 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Johns Hopkins University	$405,900
NIH 2010 R01 HG	Analysis Tools and Software for Second Generation Sequencing Data Irizarry, Rafael Angel / Johns Hopkins University	$410,000

Publications

Takeda, David Y; Spisák, Sándor; Seo, Ji-Heui et al. (2018) A Somatically Acquired Enhancer of the Androgen Receptor Is a Noncoding Driver in Advanced Prostate Cancer. Cell 174:422-432.e13

Kumar, M Senthil; Slud, Eric V; Okrah, Kwame et al. (2018) Analysis and correction of compositional bias in sparse sequencing count data. BMC Genomics 19:799

Nazario-Toole, Ashley E; Robalino, Javier; Okrah, Kwame et al. (2018) The Splicing Factor RNA-Binding Fox Protein 1 Mediates the Cellular Immune Response in Drosophila melanogaster. J Immunol 201:1154-1164

Shukla, Chinmay J; McCorkindale, Alexandra L; Gerhardinger, Chiara et al. (2018) High-throughput identification of RNA nuclear enrichment sequences. EMBO J 37:

Wu, Gang; Ruben, Marc D; Schmidt, Robert E et al. (2018) Population-level rhythms in human skin with implications for circadian medicine. Proc Natl Acad Sci U S A 115:12313-12318

Hicks, Stephanie C; Townes, F William; Teng, Mingxiang et al. (2018) Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19:562-578

McIver, Lauren J; Abu-Ali, Galeb; Franzosa, Eric A et al. (2018) bioBakery: a meta'omic analysis environment. Bioinformatics 34:1235-1237

Sinha, Rashmi; Abu-Ali, Galeb; Vogtmann, Emily et al. (2017) Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat Biotechnol 35:1077-1086

Parker, Margaret M; Chase, Robert P; Lamb, Andrew et al. (2017) RNA sequencing identifies novel non-coding RNA and exon-specific effects associated with cigarette smoking. BMC Med Genomics 10:58

Patro, Rob; Duggal, Geet; Love, Michael I et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419

Showing the most recent 10 out of 53 publications

Comments

Be the first to comment on Rafael Irizarry's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: