Technical batch effects pose a fundamental challenge to quality control and reproducibility of even single-laboratory research projects, but the possibilities for serious error are greatly magnified in complex, multi-institutional enterprises such as the cancer molecular profiling projects being undertaken by the NCI Center for Cancer Genomics (CCG). To aid in detection, quantitation, interpretation, and (when appropriate) correction for technical batch effects in such data, we have developed the MBatch computational tool and web portal. MBatch has become indispensible for quality-control ?surveillance? of data in The Cancer Genome Atlas (TCGA) project, but detecting and quantitating batch effects (or trend effects or statistical outliers) are just the first steps in a process. The next steps involve detective work in collaboration with those who generated the data, drawing upon expertise in integrative analysis across data types, pathways, and systems-level biology. That detective work usually succeeds in diagnosing the cause of a batch effect as technical or biological. If technical, then computational correction can be done (judiciously). The primary aim of the proposed Genome Data Analysis Center (GDAC) is to translate that successful quality-control model from TCGA to other current and future large-scale molecular profiling projects sponsored by the CCG. We will be ready to do that on Day 1.
The second aim i s to increase the power of MBatch to perform the basic quality-control functions. We will add a number of innovative new algorithms (Replicates- Based Normalization, Empirical Bayes++, and CorNet) and increase the repertoire of standard methods. We will also add major visualization resources including our interactive Next-Generation Clustered Heat Maps.
The third aim i s to make the system sufficiently robust, user-friendly, interactive, carefully documented, and easy to install that bench biologists and clinical researchers can use it to explore CCG-generated data or their own. Toward those ends, we have established collaborations to implement MBatch in Galaxy and on the cloud. We bring a number of assets to the proposed GDAC, including (i) multidisciplinary expertise in bioinformatics, biostatistics, software engineering, biology, and clinical oncology; PIs with a combined 21 years of experience in high-throughput molecular profiling studies of clinical cancers (in a highly consortial context); international leadership in batch effects analysis; a highly professional software engineering team with a track record of producing high-end, highly visual bioinformatics packages and websites; a team of 20 Analysts whose expertise can be called on; extensive computing resources, including one of the most powerful academically- based machines in the world; strong institutional support; close working relationships with first-class basic, translational, and clinical researchers throughout MD Anderson, one of the foremost cancer centers in the country. The bottom-line mission of the GDAC will be aid the research community's effort to understand cancer and to prevent, detect, diagnose, and treat it more effectively for the benefit of patients and their families.

Public Health Relevance

* * * * Narrative * * * * The principal goals of the Genome Data Analysis Center proposed here are (i) to protect against ?batch effect? quality-control problems in the data from large molecular profiling studies on cancer that are being undertaken by the National Cancer Institute's Center for Cancer Genomics; (ii) to provide the research community as a whole with user-friendly bioinformatic tools for doing so in those and other projects; and (iii) to participate actively in the Genomic Data Analysis Network, which is being established to bring together the best minds in cancer genomics for medical advances that benefit cancer patients and their families.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Resource-Related Research Projects--Cooperative Agreements (U24)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Yang, Liming
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas MD Anderson Cancer Center
Biostatistics & Other Math Sci
United States
Zip Code
Wang, Zehua; Yang, Bo; Zhang, Min et al. (2018) lncRNA Epigenetic Landscape Analysis Identifies EPIC1 as an Oncogenic lncRNA that Interacts with MYC and Promotes Cell-Cycle Progression in Cancer. Cancer Cell 33:706-720.e9
Taylor, Alison M; Shih, Juliann; Ha, Gavin et al. (2018) Genomic and Functional Approaches to Understanding Cancer Aneuploidy. Cancer Cell 33:676-689.e3
Saltz, Joel; Gupta, Rajarsi; Hou, Le et al. (2018) Spatial Organization and Molecular Correlation of Tumor-Infiltrating Lymphocytes Using Deep Learning on Pathology Images. Cell Rep 23:181-193.e7
Malta, Tathiane M; Sokolov, Artem; Gentles, Andrew J et al. (2018) Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation. Cell 173:338-354.e15
Ellrott, Kyle; Bailey, Matthew H; Saksena, Gordon et al. (2018) Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst 6:271-281.e7
Campbell, Joshua D; Yau, Christina; Bowlby, Reanne et al. (2018) Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas. Cell Rep 23:194-212.e6
Gao, Qingsong; Liang, Wen-Wei; Foltz, Steven M et al. (2018) Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. Cell Rep 23:227-238.e3
Thorsson, V├ęsteinn; Gibbs, David L; Brown, Scott D et al. (2018) The Immune Landscape of Cancer. Immunity 48:812-830.e14
Radovich, Milan; Pickering, Curtis R; Felau, Ina et al. (2018) The Integrated Genomic Landscape of Thymic Epithelial Tumors. Cancer Cell 33:244-258.e10
Shen, Hui; Shih, Juliann; Hollern, Daniel P et al. (2018) Integrated Molecular Characterization of Testicular Germ Cell Tumors. Cell Rep 23:3392-3406

Showing the most recent 10 out of 50 publications