Combining genomic data sets from multiple studies is advantageous to increase statistical power in studies where logistical considerations restrict sample size or require the sequential generation of data. However, significant technical heterogeneity is commonly observed across multiple batches of data that are generated from different batches, experiments, or profiling platforms. These so called batch effects often confound true biological relationships in the data, reducing the power benefits of combining multiple batches of data, and may even lead to spurious results. Many methods have been proposed to filter technical heterogeneity and batch effects from genomic data. However, there are still significant gaps that need to be addressed to more appropriately filter technical heterogeneity from genomic datasets. For example, existing approaches assume bell-shaped, symmetric data, which are not appropriate for modern sequencing count data. Furthermore, there are no current approaches for batch effects genomic data that measure features at a refined level, for example epigenetic sequencing data, where nearby features are likely to be closely correlated. Current batch adjustment methods are dependent of the data batches on hand, meaning that if additional batches of data were added to the analysis, the batch adjustments would need to be reapplied, resulting in different adjusted genomic data values. In addition, batch correction usually introduces correlation into the adjusted data, which needs to be accounted for in downstream analyses; most researchers performing batch correction before additional analysis steps are unaware of this negative impact, and as a result often incorrectly apply downstream analysis tools. Finally, it is not always clear which batch adjustment methods should be applied in each particular case, so a thorough evaluation is required before an appropriate batch correction strategy can be devised. These gaps highlight the need for new statistical methods and interactive visualization software to facilitate the needs of researchers in this area. We propose to develop algorithms and software to address these specific research gaps facing researchers combining data from multiple experimental batches.
Significant technical heterogeneity and batch effects are commonly observed across multiple batches of data. We will to develop algorithms, analysis workflows, and software to address specific gaps facing researchers combining data from genomic or epigenomic experiments. We will develop algorithms and software for (1) integrating data from sequencing studies, (2) creating reference standards for batch adjustment, (3) accounting for the impacts of batch adjustment, and (4) identification and visualization of batch effects.