Sequencing-based assays have become the technology of choice for studying genome-wide protein-DNA interactions and chromatin states defined by histone modifications (both by ChIP-seq) as well as transcriptomes (RNA-seq). Despite their widespread use, many experimental and data-analytical challenges still must be overcome to reach reliable and reproducible biological interpretations of the data. The small sample size of each individual study further limits the power and reliability of data analyses. When replicate samples or similar samples from different studies are available, reproducibility across replicate samples informs us about the fidelity of the identification, and potentially it ca be used to detect reproducible signals that are too modest to be detected reliably in individual samples. We propose to develop a suite of new statistical methods that make use of the reproducibility information provided by the replicate samples to examine the quality of experiments, select reliable identifications, and optimize operational parameters in the experimental design.
Aim 1 will develop statistical methods to assess the reproducibility of identifications and to select identifications by their reproducibility in several sequencing-based analyses. The reproducibility-based selection criterion complements the usual measure of significance on a single sample, but has the benefit of being comparable across data sets, platforms and different measures of significance.
Aim 2 will develop a regression framework to assess how operational parameters in the experimental and data analytical procedures affect the reproducibility of ChIP-seq and RNA-seq experiments. It will allow one to characterize the simultaneous and independent effects of covariates on reproducibility of the assays and to compare reproducibility of protocols while controlling for potential confounding variables.
Aim 3 will develop semi-parametric, rank-based meta-analysis methods for integrating RNA-seq-based transcriptome analyses from different sources. The proposed methods will take into account heterogeneity due to data sources, and they will incorporate the study goals in the meta-analysis.
Susceptibilities to common diseases are determined not only by the environment but also by genetic variants around many genes. These variants tend to be in regulatory regions, which can be identified by using new sequencing-based assays. Our proposed statistical tools will improve the reliability of interpretations derived from these assay, and hence will increase the robustness of the molecular understanding of human diseases derived from these assays.