Technical batch effects pose a fundamental challenge to quality control and reproducibility of even single-laboratory research projects, but the possibilities for serious error are greatly magnified in complex, multi-institutional enterprises such as the cancer molecular profiling projects being undertaken by the NCI Center for Cancer Genomics (CCG). To aid in detection, quantitation, interpretation, and (when appropriate) correction for technical batch effects in such data, we have developed the MBatch computational tool and web portal. MBatch has become indispensible for quality-control ?surveillance? of data in The Cancer Genome Atlas (TCGA) project, but detecting and quantitating batch effects (or trend effects or statistical outliers) are just the first steps in a process. The next steps involve detective work in collaboration with those who generated the data, drawing upon expertise in integrative analysis across data types, pathways, and systems-level biology. That detective work usually succeeds in diagnosing the cause of a batch effect as technical or biological. If technical, then computational correction can be done (judiciously). The primary aim of the proposed Genome Data Analysis Center (GDAC) is to translate that successful quality-control model from TCGA to other current and future large-scale molecular profiling projects sponsored by the CCG. We will be ready to do that on Day 1.
The second aim i s to increase the power of MBatch to perform the basic quality-control functions. We will add a number of innovative new algorithms (Replicates- Based Normalization, Empirical Bayes++, and CorNet) and increase the repertoire of standard methods. We will also add major visualization resources including our interactive Next-Generation Clustered Heat Maps.
The third aim i s to make the system sufficiently robust, user-friendly, interactive, carefully documented, and easy to install that bench biologists and clinical researchers can use it to explore CCG-generated data or their own. Toward those ends, we have established collaborations to implement MBatch in Galaxy and on the cloud. We bring a number of assets to the proposed GDAC, including (i) multidisciplinary expertise in bioinformatics, biostatistics, software engineering, biology, and clinical oncology; PIs with a combined 21 years of experience in high-throughput molecular profiling studies of clinical cancers (in a highly consortial context); international leadership in batch effects analysis; a highly professional software engineering team with a track record of producing high-end, highly visual bioinformatics packages and websites; a team of 20 Analysts whose expertise can be called on; extensive computing resources, including one of the most powerful academically- based machines in the world; strong institutional support; close working relationships with first-class basic, translational, and clinical researchers throughout MD Anderson, one of the foremost cancer centers in the country. The bottom-line mission of the GDAC will be aid the research community's effort to understand cancer and to prevent, detect, diagnose, and treat it more effectively for the benefit of patients and their families.

Public Health Relevance

* * * * Narrative * * * * The principal goals of the Genome Data Analysis Center proposed here are (i) to protect against ?batch effect? quality-control problems in the data from large molecular profiling studies on cancer that are being undertaken by the National Cancer Institute's Center for Cancer Genomics; (ii) to provide the research community as a whole with user-friendly bioinformatic tools for doing so in those and other projects; and (iii) to participate actively in the Genomic Data Analysis Network, which is being established to bring together the best minds in cancer genomics for medical advances that benefit cancer patients and their families.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Resource-Related Research Projects--Cooperative Agreements (U24)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Yang, Liming
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas MD Anderson Cancer Center
Biostatistics & Other Math Sci
United States
Zip Code
Radovich, Milan; Pickering, Curtis R; Felau, Ina et al. (2018) The Integrated Genomic Landscape of Thymic Epithelial Tumors. Cancer Cell 33:244-258.e10
Shen, Hui; Shih, Juliann; Hollern, Daniel P et al. (2018) Integrated Molecular Characterization of Testicular Germ Cell Tumors. Cell Rep 23:3392-3406
Berger, Ashton C; Korkut, Anil; Kanchi, Rupa S et al. (2018) A Comprehensive Pan-Cancer Molecular Study of Gynecologic and Breast Cancers. Cancer Cell 33:690-705.e9
Corces, M Ryan; Granja, Jeffrey M; Shams, Shadi et al. (2018) The chromatin accessibility landscape of primary human cancers. Science 362:
Hoadley, Katherine A; Yau, Christina; Hinoue, Toshinori et al. (2018) Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell 173:291-304.e6
Schaub, Franz X; Dhankani, Varsha; Berger, Ashton C et al. (2018) Pan-cancer Alterations of the MYC Oncogene and Its Proximal Network across the Cancer Genome Atlas. Cell Syst 6:282-300.e2
Liu, Jianfang; Lichtenberg, Tara; Hoadley, Katherine A et al. (2018) An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell 173:400-416.e11
Sanchez-Vega, Francisco; Mina, Marco; Armenia, Joshua et al. (2018) Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell 173:321-337.e10
Way, Gregory P; Sanchez-Vega, Francisco; La, Konnor et al. (2018) Machine Learning Detects Pan-cancer Ras Pathway Activation in The Cancer Genome Atlas. Cell Rep 23:172-180.e3
Ge, Zhongqi; Leighton, Jake S; Wang, Yumeng et al. (2018) Integrated Genomic Analysis of the Ubiquitin Pathway across Cancer Types. Cell Rep 23:213-226.e3

Showing the most recent 10 out of 50 publications