Removing batch effects in genomic and epigenomic studies

Johnson, William

Abstract

Combining genomic data sets from multiple studies is advantageous to increase statistical power in studies where logistical considerations restrict sample size or require the sequential generation of data. However, significant technical heterogeneity is commonly observed across multiple batches of data that are generated from different batches, experiments, or profiling platforms. These so called batch effects often confound true biological relationships in the data, reducing the power benefits of combining multiple batches of data, and may even lead to spurious results. Many methods have been proposed to filter technical heterogeneity and batch effects from genomic data. However, there are still significant gaps that need to be addressed to more appropriately filter technical heterogeneity from genomic datasets. For example, existing approaches assume bell-shaped, symmetric data, which are not appropriate for modern sequencing count data. Furthermore, there are no current approaches for batch effects genomic data that measure features at a refined level, for example epigenetic sequencing data, where nearby features are likely to be closely correlated. Current batch adjustment methods are dependent of the data batches on hand, meaning that if additional batches of data were added to the analysis, the batch adjustments would need to be reapplied, resulting in different adjusted genomic data values. In addition, batch correction usually introduces correlation into the adjusted data, which needs to be accounted for in downstream analyses; most researchers performing batch correction before additional analysis steps are unaware of this negative impact, and as a result often incorrectly apply downstream analysis tools. Finally, it is not always clear which batch adjustment methods should be applied in each particular case, so a thorough evaluation is required before an appropriate batch correction strategy can be devised. These gaps highlight the need for new statistical methods and interactive visualization software to facilitate the needs of researchers in this area. We propose to develop algorithms and software to address these specific research gaps facing researchers combining data from multiple experimental batches.

Public Health Relevance

Significant technical heterogeneity and batch effects are commonly observed across multiple batches of data. We will to develop algorithms, analysis workflows, and software to address specific gaps facing researchers combining data from genomic or epigenomic experiments. We will develop algorithms and software for (1) integrating data from sequencing studies, (2) creating reference standards for batch adjustment, (3) accounting for the impacts of batch adjustment, and (4) identification and visualization of batch effects.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM127430-02
Application #: 9691433
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Ravichandran, Veerasamy

Project Start: 2018-05-01
Project End: 2022-04-30
Budget Start: 2019-05-01
Budget End: 2020-04-30
Support Year: 2
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: Boston University
Department: Internal Medicine/Medicine
Type: Schools of Medicine
DUNS #: 604483045

City: Boston
State: MA
Country: United States
Zip Code: 02118

Related projects


NIH 2020 R01 GM	Removing batch effects in genomic and epigenomic studies Johnson, William Evan / Boston University
NIH 2019 R01 GM	Removing batch effects in genomic and epigenomic studies Johnson, William Evan / Boston University
NIH 2018 R01 GM	Removing batch effects in genomic and epigenomic studies Johnson, William Evan / Boston University

Comments

Be the first to comment on William Johnson's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: