Combining genomic data sets from multiple studies is advantageous to increase statistical power in studies where logistical considerations restrict sample size or require the sequential generation of data. However, significant technical heterogeneity is commonly observed across multiple batches of data that are generated from different batches, experiments, or profiling platforms. These so called batch effects often confound true biological relationships in the data, reducing the power benefits of combining multiple batches of data, and may even lead to spurious results. Many methods have been proposed to filter technical heterogeneity and batch effects from genomic data. However, there are still significant gaps that need to be addressed to more appropriately filter technical heterogeneity from genomic datasets. For example, existing approaches assume bell-shaped, symmetric data, which are not appropriate for modern sequencing count data. Furthermore, there are no current approaches for batch effects genomic data that measure features at a refined level, for example epigenetic sequencing data, where nearby features are likely to be closely correlated. Current batch adjustment methods are dependent of the data batches on hand, meaning that if additional batches of data were added to the analysis, the batch adjustments would need to be reapplied, resulting in different adjusted genomic data values. In addition, batch correction usually introduces correlation into the adjusted data, which needs to be accounted for in downstream analyses; most researchers performing batch correction before additional analysis steps are unaware of this negative impact, and as a result often incorrectly apply downstream analysis tools. Finally, it is not always clear which batch adjustment methods should be applied in each particular case, so a thorough evaluation is required before an appropriate batch correction strategy can be devised. These gaps highlight the need for new statistical methods and interactive visualization software to facilitate the needs of researchers in this area. We propose to develop algorithms and software to address these specific research gaps facing researchers combining data from multiple experimental batches.

Public Health Relevance

Significant technical heterogeneity and batch effects are commonly observed across multiple batches of data. We will to develop algorithms, analysis workflows, and software to address specific gaps facing researchers combining data from genomic or epigenomic experiments. We will develop algorithms and software for (1) integrating data from sequencing studies, (2) creating reference standards for batch adjustment, (3) accounting for the impacts of batch adjustment, and (4) identification and visualization of batch effects.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM127430-02
Application #
9691433
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Ravichandran, Veerasamy
Project Start
2018-05-01
Project End
2022-04-30
Budget Start
2019-05-01
Budget End
2020-04-30
Support Year
2
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Boston University
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
604483045
City
Boston
State
MA
Country
United States
Zip Code
02118