The spatial organization of the genome in the nucleus plays an important role in the transcriptional control of genes. Currently, Hi-C is the most widely used high-throughput technique that probes the genome-wide spatial organization of chromatin. However, Hi-C experiments involve multiple complex experimental steps, introducing various sources of biases. Many data-analytical challenges still must be overcome to reach reliable and reproducible biological interpretations of the data. The small sample size of each individual study further limits the power and reliability of data analyses. When replicate samples are available, reproducibility across replicate samples informs us about the fidelity of the identification, and potentially it can be used to detect reproducible signals that are too modest to be detected reliably in individual samples. Even for samples from different cells, information may be borrowed through joint analyses to improve the identification of both topologically associated domains (TADs) and regions with different structures. This project proposes to develop a suite of new statistical methods that use the reproducibility information provided by replicate samples to select reliable identifications and to improve the accuracy of peak calling and TAD calling. Furthermore, it proposes a joint analysis framework to identify condition-specific architectural differences across different cells.
Aim 1 will develop statistical methods to evaluate the reproducibility of identified chromatin loops and to select reproducible identifications. The reproducibility-based selection criterion complements the usual measure of significance on a single sample, but has the benefit of being comparable across data sets, protocols and different measures of significance.
Aim 2 will develop robust, joint multi-sample peak calling and TAD calling methods. These methods will allow one to synergize information across samples and properly take account of variations across replicates, ultimately improving the power of the analysis and reducing false positives.
Aim 3 will develop statistical methods for detecting TAD and other architectural differences between different cell types, cellular conditions, or disease status. Included in each proposed Aim are rigorous evaluations of the output of these methods utilizing orthogonal epigenomic data and experimental tests of hypotheses derived from the results of the analytical methods. These methods will enable users to generate reliable and robust scientific interpretation, and ultimately advance the understanding of nuclear organization and its role in gene expression and cellular function.
The 3D genome organization plays an important role in regulating gene expression, and alteration in 3D architectures can lead to cancer or other diseases. Hi-C data provide a genome-wide view to study genome architectures, but many challenges still remain in its data analysis. Our proposed statistical tools will improve the reliability of interpretations derived from Hi-C data, and hence will increase the robustness of the molecular understanding of human diseases.