Statistical methods for improving reproducibility and utility of sequencing data

Li, Qunhua

Abstract

Sequencing-based assays have become the technology of choice for studying genome-wide protein-DNA interactions and chromatin states defined by histone modifications (both by ChIP-seq) as well as transcriptomes (RNA-seq). Despite their widespread use, many experimental and data-analytical challenges still must be overcome to reach reliable and reproducible biological interpretations of the data. The small sample size of each individual study further limits the power and reliability of data analyses. When replicate samples or similar samples from different studies are available, reproducibility across replicate samples informs us about the fidelity of the identification, and potentially it ca be used to detect reproducible signals that are too modest to be detected reliably in individual samples. We propose to develop a suite of new statistical methods that make use of the reproducibility information provided by the replicate samples to examine the quality of experiments, select reliable identifications, and optimize operational parameters in the experimental design.
Aim 1 will develop statistical methods to assess the reproducibility of identifications and to select identifications by their reproducibility in several sequencing-based analyses. The reproducibility-based selection criterion complements the usual measure of significance on a single sample, but has the benefit of being comparable across data sets, platforms and different measures of significance.
Aim 2 will develop a regression framework to assess how operational parameters in the experimental and data analytical procedures affect the reproducibility of ChIP-seq and RNA-seq experiments. It will allow one to characterize the simultaneous and independent effects of covariates on reproducibility of the assays and to compare reproducibility of protocols while controlling for potential confounding variables.
Aim 3 will develop semi-parametric, rank-based meta-analysis methods for integrating RNA-seq-based transcriptome analyses from different sources. The proposed methods will take into account heterogeneity due to data sources, and they will incorporate the study goals in the meta-analysis.

Public Health Relevance

Susceptibilities to common diseases are determined not only by the environment but also by genetic variants around many genes. These variants tend to be in regulatory regions, which can be identified by using new sequencing-based assays. Our proposed statistical tools will improve the reliability of interpretations derived from these assay, and hence will increase the robustness of the molecular understanding of human diseases derived from these assays.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 4R01GM109453-04
Application #: 9113042
Study Section: Special Emphasis Panel (ZGM1)
Program Officer: Marcus, Stephen

Project Start: 2013-09-01
Project End: 2017-07-31
Budget Start: 2016-08-01
Budget End: 2017-07-31
Support Year: 4
Fiscal Year: 2016
Total Cost
Indirect Cost

Institution

Name: Pennsylvania State University
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 003403953

City: University Park
State: PA
Country: United States
Zip Code: 16802

Related projects


NIH 2020 R01 GM	Statistical methods for improving reproducibility and utility of chromatin interaction data Li, Qunhua / Pennsylvania State University
NIH 2019 R01 GM	Statistical methods for improving reproducibility and utility of chromatin interaction data Li, Qunhua / Pennsylvania State University
NIH 2016 R01 GM	Statistical methods for improving reproducibility and utility of sequencing data Li, Qunhua / Pennsylvania State University
NIH 2015 R01 GM	Statistical methods for improving reproducibility and utility of sequencing data Li, Qunhua / Pennsylvania State University
NIH 2014 R01 GM	Statistical methods for improving reproducibility and utility of sequencing data Li, Qunhua / Pennsylvania State University
NIH 2013 R01 GM	Statistical methods for improving reproducibility and utility of sequencing data Li, Qunhua / Pennsylvania State University	$271,736

Publications

Li, Qunhua; Zhang, Feipeng (2018) A regression framework for assessing covariate effects on the reproducibility of high-throughput experiments. Biometrics 74:803-813

Yang, Tao; Zhang, Feipeng; Yard?mc?, Galip Gürkan et al. (2017) HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res 27:1939-1949

Zhang, Feipeng; Li, Qunhua (2017) Robust bent line regression. J Stat Plan Inference 185:41-55

Zhang, Feipeng; Li, Qunhua (2017) A Continuous Threshold Expectile Model. Comput Stat Data Anal 116:49-66

Charepalli, Venkata; Reddivari, Lavanya; Radhakrishnan, Sridhar et al. (2017) Pigs, Unlike Mice, Have Two Distinct Colonic Stem Cell Populations Similar to Humans That Respond to High-Calorie Diet prior to Insulin Resistance. Cancer Prev Res (Phila) 10:442-450

Lyu, Yafei; Li, Qunhua (2016) A semi-parametric statistical model for integrating gene expression profiles across different platforms. BMC Bioinformatics 17 Suppl 1:5

Song, C; Pan, X; Ge, Z et al. (2016) Epigenetic regulation of gene expression by Ikaros, HDAC1 and Casein Kinase II in leukemia. Leukemia 30:1436-40

Bailey, Timothy; Krajewski, Pawel; Ladunga, Istvan et al. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput Biol 9:e1003326

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: