Sequencing-based assays have become the technology of choice for studying genome-wide protein-DNA interactions and chromatin states defined by histone modifications (both by ChIP-seq) as well as transcriptomes (RNA-seq). Despite their widespread use, many experimental and data-analytical challenges still must be overcome to reach reliable and reproducible biological interpretations of the data. The small sample size of each individual study further limits the power and reliability of data analyses. When replicate samples or similar samples from different studies are available, reproducibility across replicate samples informs us about the fidelity of the identification, and potentially it ca be used to detect reproducible signals that are too modest to be detected reliably in individual samples. We propose to develop a suite of new statistical methods that make use of the reproducibility information provided by the replicate samples to examine the quality of experiments, select reliable identifications, and optimize operational parameters in the experimental design.
Aim 1 will develop statistical methods to assess the reproducibility of identifications and to select identifications by their reproducibility in several sequencing-based analyses. The reproducibility-based selection criterion complements the usual measure of significance on a single sample, but has the benefit of being comparable across data sets, platforms and different measures of significance.
Aim 2 will develop a regression framework to assess how operational parameters in the experimental and data analytical procedures affect the reproducibility of ChIP-seq and RNA-seq experiments. It will allow one to characterize the simultaneous and independent effects of covariates on reproducibility of the assays and to compare reproducibility of protocols while controlling for potential confounding variables.
Aim 3 will develop semi-parametric, rank-based meta-analysis methods for integrating RNA-seq-based transcriptome analyses from different sources. The proposed methods will take into account heterogeneity due to data sources, and they will incorporate the study goals in the meta-analysis.

Public Health Relevance

Susceptibilities to common diseases are determined not only by the environment but also by genetic variants around many genes. These variants tend to be in regulatory regions, which can be identified by using new sequencing-based assays. Our proposed statistical tools will improve the reliability of interpretations derived from these assay, and hence will increase the robustness of the molecular understanding of human diseases derived from these assays.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Marcus, Stephen
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Pennsylvania State University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
University Park
United States
Zip Code
Li, Qunhua; Zhang, Feipeng (2018) A regression framework for assessing covariate effects on the reproducibility of high-throughput experiments. Biometrics 74:803-813
Yang, Tao; Zhang, Feipeng; Yard?mc?, Galip G├╝rkan et al. (2017) HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res 27:1939-1949
Zhang, Feipeng; Li, Qunhua (2017) Robust bent line regression. J Stat Plan Inference 185:41-55
Zhang, Feipeng; Li, Qunhua (2017) A Continuous Threshold Expectile Model. Comput Stat Data Anal 116:49-66
Charepalli, Venkata; Reddivari, Lavanya; Radhakrishnan, Sridhar et al. (2017) Pigs, Unlike Mice, Have Two Distinct Colonic Stem Cell Populations Similar to Humans That Respond to High-Calorie Diet prior to Insulin Resistance. Cancer Prev Res (Phila) 10:442-450
Song, C; Pan, X; Ge, Z et al. (2016) Epigenetic regulation of gene expression by Ikaros, HDAC1 and Casein Kinase II in leukemia. Leukemia 30:1436-40
Lyu, Yafei; Li, Qunhua (2016) A semi-parametric statistical model for integrating gene expression profiles across different platforms. BMC Bioinformatics 17 Suppl 1:5
Bailey, Timothy; Krajewski, Pawel; Ladunga, Istvan et al. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol 9:e1003326