Second-generation sequencing (sec-gen) technology is poised to radically change how genomic data is obtained and used. Capable of sequencing millions of short strands of DNA in parallel, this technology can be used to assemble complex genomes for a small fraction of the price and time of previous technologies. In fact, a recently formed international consortium, the 1000 Genomes Project, plans to sequence the genomes of approximately 1,200 people. The possibility of comparative analysis at the sequence level of a large number of samples across multiple populations may be achievable within the next five years. These datasets also present unprecedented challenges in statistical analysis and data management. For example, a central goal of the 1000 Genomes Project is to quantify across-sample variation at the single nucleotide level. At this resolution, small error rates in sequencing prove significant, especially for rare variants. Furthermore, sec-gen sequencing is a relatively new technology for which potential biases and sources of obscuring variation are not yet fully understood. Therefore, modeling and quantifying the uncertainty inherent in the generation of sequencing reads is of utmost importance. Properly relating this uncertainty to the true underlying variation in the genome, especially, variation between and among populations will be essential for projects that use sec-gen sequencing data to meet their scientific goals. Although genome sequencing is the application that most attention has received, sec-gen technology is also being used to produce quantitative measurements related to applications previously associated with microarrays. Of these, chromatin immunoprecipitation followed by sequencing (ChIP- Seq) has been the most successful. Existing tools have been developed for analyzing one sample at a time. Methodology for drawing inference from multiple samples has not yet been developed. The demand for such methods will increase rapidly as the technology becomes more economical and multiple samples become standard. Other applications for which statistical methodology is needed are RNA and microRNA transcription analysis. In all these sequencing applications, a number of critical steps are required to convert raw intensity measures into the sequence reads that will be used in down-stream analysis. Ad-hoc approaches, that assign weights to each base call, are unsuitable. Our goal is to create a sound and unified statistical and computational methodology for representing and managing uncertainty throughout the sec-gen sequencing data analysis pipeline built on a robust, modular and extensible software platform.

Public Health Relevance

Second-generation sequencing technology is poised to radically change how genomic data is obtained and used. These datasets also present unprecedented challenges in statistical analysis and modeling and quantifying uncertainty inherent in the generation of sequencing reads is of utmost importance. We will develop data analysis tools for widely used applications using statistical methods that account for this uncertainty.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG005220-03
Application #
8280415
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2010-08-11
Project End
2014-05-31
Budget Start
2012-06-01
Budget End
2014-05-31
Support Year
3
Fiscal Year
2012
Total Cost
$405,900
Indirect Cost
$158,400
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Alemu, Elfalem Y; Carl Jr, Joseph W; Corrada Bravo, Héctor et al. (2014) Determinants of expression variability. Nucleic Acids Res 42:3503-14
Chelaru, Florin; Smith, Llewellyn; Goldstein, Naomi et al. (2014) Epiviz: interactive visual analytics for functional genomics data. Nat Methods 11:938-40
Ye, Chengxi; Hsiao, Chiaowen; Corrada Bravo, Héctor (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30:1214-9
Pop, Mihai; Walker, Alan W; Paulson, Joseph et al. (2014) Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota composition. Genome Biol 15:R76
Scharpf, Robert B; Mireles, Lynn; Yang, Qiong et al. (2014) Copy number polymorphisms near SLC2A9 are associated with serum uric acid concentrations. BMC Genet 15:81
Halper-Stromberg, Eitan; Steranka, Jared; Burns, Kathleen H et al. (2014) Visualization and probability-based scoring of structural variants within repetitive sequences. Bioinformatics 30:1514-21
Frazee, Alyssa C; Sabunciyan, Sarven; Hansen, Kasper D et al. (2014) Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics 15:413-26
Hansen, Kasper D; Sabunciyan, Sarven; Langmead, Ben et al. (2014) Large-scale hypomethylated blocks associated with Epstein-Barr virus-induced B-cell immortalization. Genome Res 24:177-84
Halper-Stromberg, Eitan; Steranka, Jared; Giraldo-Castillo, Nicolas et al. (2013) Fine mapping of V(D)J recombinase mediated rearrangements in human lymphoid malignancies. BMC Genomics 14:565
Paulson, Joseph N; Stine, O Colin; Bravo, Hector Corrada et al. (2013) Differential abundance analysis for microbial marker-gene surveys. Nat Methods 10:1200-2

Showing the most recent 10 out of 21 publications