The spatial organization of the genome in the nucleus plays an important role in the transcriptional control of genes. Currently, Hi-C is the most widely used high-throughput technique that probes the genome-wide spatial organization of chromatin. However, Hi-C experiments involve multiple complex experimental steps, introducing various sources of biases. Many data-analytical challenges still must be overcome to reach reliable and reproducible biological interpretations of the data. The small sample size of each individual study further limits the power and reliability of data analyses. When replicate samples are available, reproducibility across replicate samples informs us about the fidelity of the identification, and potentially it can be used to detect reproducible signals that are too modest to be detected reliably in individual samples. Even for samples from different cells, information may be borrowed through joint analyses to improve the identification of both topologically associated domains (TADs) and regions with different structures. This project proposes to develop a suite of new statistical methods that use the reproducibility information provided by replicate samples to select reliable identifications and to improve the accuracy of peak calling and TAD calling. Furthermore, it proposes a joint analysis framework to identify condition-specific architectural differences across different cells.
Aim 1 will develop statistical methods to evaluate the reproducibility of identified chromatin loops and to select reproducible identifications. The reproducibility-based selection criterion complements the usual measure of significance on a single sample, but has the benefit of being comparable across data sets, protocols and different measures of significance.
Aim 2 will develop robust, joint multi-sample peak calling and TAD calling methods. These methods will allow one to synergize information across samples and properly take account of variations across replicates, ultimately improving the power of the analysis and reducing false positives.
Aim 3 will develop statistical methods for detecting TAD and other architectural differences between different cell types, cellular conditions, or disease status. Included in each proposed Aim are rigorous evaluations of the output of these methods utilizing orthogonal epigenomic data and experimental tests of hypotheses derived from the results of the analytical methods. These methods will enable users to generate reliable and robust scientific interpretation, and ultimately advance the understanding of nuclear organization and its role in gene expression and cellular function.

Public Health Relevance

The 3D genome organization plays an important role in regulating gene expression, and alteration in 3D architectures can lead to cancer or other diseases. Hi-C data provide a genome-wide view to study genome architectures, but many challenges still remain in its data analysis. Our proposed statistical tools will improve the reliability of interpretations derived from Hi-C data, and hence will increase the robustness of the molecular understanding of human diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
2R01GM109453-05
Application #
9740754
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Paul
Project Start
2013-09-01
Project End
2023-03-31
Budget Start
2019-04-01
Budget End
2020-03-31
Support Year
5
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Pennsylvania State University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
003403953
City
University Park
State
PA
Country
United States
Zip Code
16802
Li, Qunhua; Zhang, Feipeng (2018) A regression framework for assessing covariate effects on the reproducibility of high-throughput experiments. Biometrics 74:803-813
Yang, Tao; Zhang, Feipeng; Yard?mc?, Galip Gürkan et al. (2017) HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res 27:1939-1949
Zhang, Feipeng; Li, Qunhua (2017) Robust bent line regression. J Stat Plan Inference 185:41-55
Zhang, Feipeng; Li, Qunhua (2017) A Continuous Threshold Expectile Model. Comput Stat Data Anal 116:49-66
Charepalli, Venkata; Reddivari, Lavanya; Radhakrishnan, Sridhar et al. (2017) Pigs, Unlike Mice, Have Two Distinct Colonic Stem Cell Populations Similar to Humans That Respond to High-Calorie Diet prior to Insulin Resistance. Cancer Prev Res (Phila) 10:442-450
Song, C; Pan, X; Ge, Z et al. (2016) Epigenetic regulation of gene expression by Ikaros, HDAC1 and Casein Kinase II in leukemia. Leukemia 30:1436-40
Lyu, Yafei; Li, Qunhua (2016) A semi-parametric statistical model for integrating gene expression profiles across different platforms. BMC Bioinformatics 17 Suppl 1:5
Bailey, Timothy; Krajewski, Pawel; Ladunga, Istvan et al. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol 9:e1003326