Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction

Akbani, Rehan; Mills, Gordon; Weinstein, John

Abstract

Technical batch effects pose a fundamental challenge to quality control and reproducibility of even single-laboratory research projects, but the possibilities for serious error are greatly magnified in complex, multi-institutional enterprises such as the cancer molecular profiling projects being undertaken by the NCI Center for Cancer Genomics (CCG). To aid in detection, quantitation, interpretation, and (when appropriate) correction for technical batch effects in such data, we have developed the MBatch computational tool and web portal. MBatch has become indispensible for quality-control ?surveillance? of data in The Cancer Genome Atlas (TCGA) project, but detecting and quantitating batch effects (or trend effects or statistical outliers) are just the first steps in a process. The next steps involve detective work in collaboration with those who generated the data, drawing upon expertise in integrative analysis across data types, pathways, and systems-level biology. That detective work usually succeeds in diagnosing the cause of a batch effect as technical or biological. If technical, then computational correction can be done (judiciously). The primary aim of the proposed Genome Data Analysis Center (GDAC) is to translate that successful quality-control model from TCGA to other current and future large-scale molecular profiling projects sponsored by the CCG. We will be ready to do that on Day 1.
The second aim i s to increase the power of MBatch to perform the basic quality-control functions. We will add a number of innovative new algorithms (Replicates- Based Normalization, Empirical Bayes++, and CorNet) and increase the repertoire of standard methods. We will also add major visualization resources including our interactive Next-Generation Clustered Heat Maps.
The third aim i s to make the system sufficiently robust, user-friendly, interactive, carefully documented, and easy to install that bench biologists and clinical researchers can use it to explore CCG-generated data or their own. Toward those ends, we have established collaborations to implement MBatch in Galaxy and on the cloud. We bring a number of assets to the proposed GDAC, including (i) multidisciplinary expertise in bioinformatics, biostatistics, software engineering, biology, and clinical oncology; PIs with a combined 21 years of experience in high-throughput molecular profiling studies of clinical cancers (in a highly consortial context); international leadership in batch effects analysis; a highly professional software engineering team with a track record of producing high-end, highly visual bioinformatics packages and websites; a team of 20 Analysts whose expertise can be called on; extensive computing resources, including one of the most powerful academically- based machines in the world; strong institutional support; close working relationships with first-class basic, translational, and clinical researchers throughout MD Anderson, one of the foremost cancer centers in the country. The bottom-line mission of the GDAC will be aid the research community's effort to understand cancer and to prevent, detect, diagnose, and treat it more effectively for the benefit of patients and their families.

Public Health Relevance

* * * * Narrative * * * * The principal goals of the Genome Data Analysis Center proposed here are (i) to protect against ?batch effect? quality-control problems in the data from large molecular profiling studies on cancer that are being undertaken by the National Cancer Institute's Center for Cancer Genomics; (ii) to provide the research community as a whole with user-friendly bioinformatic tools for doing so in those and other projects; and (iii) to participate actively in the Genomic Data Analysis Network, which is being established to bring together the best minds in cancer genomics for medical advances that benefit cancer patients and their families.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Cancer Institute (NCI)
Type: Resource-Related Research Projects--Cooperative Agreements (U24)
Project #: 5U24CA210949-04
Application #: 9789027
Study Section: Special Emphasis Panel (ZCA1)
Program Officer: Yang, Liming

Project Start: 2016-09-13
Project End: 2021-08-31
Budget Start: 2019-09-01
Budget End: 2020-08-31
Support Year: 4
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: University of Texas MD Anderson Cancer Center
Department: Biostatistics & Other Math Sci
Type: Hospitals
DUNS #: 800772139

City: Houston
State: TX
Country: United States
Zip Code: 77030

Related projects


NIH 2020 U24 CA	Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction Akbani, Rehan; Mills, Gordon B.; Weinstein, John N. / University of Texas MD Anderson Cancer Center
NIH 2019 U24 CA	Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction Akbani, Rehan; Mills, Gordon B.; Weinstein, John N. / University of Texas MD Anderson Cancer Center
NIH 2018 U24 CA	Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction Akbani, Rehan; Mills, Gordon B.; Weinstein, John N. / University of Texas MD Anderson Cancer Center
NIH 2017 U24 CA	Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction Akbani, Rehan; Mills, Gordon B.; Weinstein, John N. / University of Texas MD Anderson Cancer Center
NIH 2016 U24 CA	Batch effects in molecular profiling data on cancers: detection, quantitation, interpretation, and correction Weinstein, John N.; Akbani, Rehan; Mills, Gordon B. / University of Texas MD Anderson Cancer Center	$418,519

Publications

Huang, Kuan-Lin; Mashl, R Jay; Wu, Yige et al. (2018) Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173:355-370.e14

Chiu, Hua-Sheng; Somvanshi, Sonal; Patel, Ektaben et al. (2018) Pan-Cancer Analysis of lncRNA Regulation Supports Their Targeting of Cancer Genes in Each Tumor Context. Cell Rep 23:297-312.e12

Ding, Li; Bailey, Matthew H; Porta-Pardo, Eduard et al. (2018) Perspective on Oncogenic Processes at the End of the Beginning of Cancer Genomics. Cell 173:305-320.e10

Seiler, Michael; Peng, Shouyong; Agrawal, Anant A et al. (2018) Somatic Mutational Landscape of Splicing Factor Genes and Their Functional Consequences across 33 Cancer Types. Cell Rep 23:282-296.e4

Liu, Yang; Sethi, Nilay S; Hinoue, Toshinori et al. (2018) Comparative Molecular Analysis of Gastrointestinal Adenocarcinomas. Cancer Cell 33:721-735.e8

Wang, Zehua; Yang, Bo; Zhang, Min et al. (2018) lncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in Cancer. Cancer Cell 33:706-720.e9

Taylor, Alison M; Shih, Juliann; Ha, Gavin et al. (2018) Genomic and Functional Approaches to Understanding Cancer Aneuploidy. Cancer Cell 33:676-689.e3

Saltz, Joel; Gupta, Rajarsi; Hou, Le et al. (2018) Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. Cell Rep 23:181-193.e7

Malta, Tathiane M; Sokolov, Artem; Gentles, Andrew J et al. (2018) Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. Cell 173:338-354.e15

Ellrott, Kyle; Bailey, Matthew H; Saksena, Gordon et al. (2018) Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst 6:271-281.e7

Showing the most recent 10 out of 50 publications

Comments

Be the first to comment on Rehan Akbani's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: