This proposal develops scalable R / Bioconductor software infrastructure and data resources to integrate complex, heterogeneous, and large cancer genomic experiments. The falling cost of genomic assays facilitates collection of multiple data types (e.g., gene and transcript expression, structural variation, copy number, methylation, and microRNA data) from a set of clinical specimens. Furthermore, substantial resources are now available from large consortium activities like The Cancer Genome Atlas (TCGA). Existing analysis pipelines focus on the treatment of a specific data type, leaving a critical need for tool for integrative analysis of multiple genomic assays for locally generated or publicly available data. R / Bioconductor has historically provided standardized genomic data structures and annotations that have enjoyed widespread adoption in the cancer genomics research community. This proposal adapts R / Bioconductor to meet the increasing conceptual and computational complexity of multi-assay cancer genomic experiments. We begin by developing software containers for coordinated representation, manipulation, and transformation of heterogeneous derived data from multiple cancer genomic assays. These containers are then extended to manage very large primary data resources. To facilitate integration of local experimental results with major public cancer genomics experiment data sets and annotations, we re-package public resources and provide software and cloud-based facilities for easy and fast programmatic access from within R/Bioconductor. This greatly simplifies cancer genomic analysis tasks that otherwise require significant, error-prone individual efforts. Finally, we provide software infrastructure to enable high-throughput computation using parallel and iterative approaches. The ability to manipulate multi-assay cancer genomic experiments, to understand individual experimental results in the context of public experiments and annotations, and facilities for improved high-throughput computational performance in a well-established computing environment greatly enhances opportunities for analysis and comprehension of large multi-assay cancer genomic experiments.

Public Health Relevance

Researchers collect diverse types of complex genetic information about factors that contribute to cancer. This proposal helps researchers manage and analyze this information using advanced computational and statistical approaches.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Resource-Related Research Projects--Cooperative Agreements (U24)
Project #
1U24CA180996-01A1
Application #
8787350
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Chen, Huann-Sheng
Project Start
2014-09-01
Project End
2019-08-31
Budget Start
2014-09-01
Budget End
2015-08-31
Support Year
1
Fiscal Year
2014
Total Cost
Indirect Cost
Name
Fred Hutchinson Cancer Research Center
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98109
Carlson, Marc R J; Pagès, Hervé; Arora, Sonali et al. (2016) Genomic Annotation Resources in R/Bioconductor. Methods Mol Biol 1418:67-90
Kannan, Lavanya; Ramos, Marcel; Re, Angela et al. (2016) Public data and open source tools for multi-assay genomic investigation of disease. Brief Bioinform 17:603-15
Spratt, Daniel E; Chan, Tiffany; Waldron, Levi et al. (2016) Racial/Ethnic Disparities in Genomic Sequencing. JAMA Oncol 2:1070-4
Waldron, Levi; Riester, Markus; Ramos, Marcel et al. (2016) The Doppelgänger Effect: Hidden Duplicates in Databases of Transcriptome Profiles. J Natl Cancer Inst 108: