In this proposal, we develop Bayesian methodology for high dimensional genomic data. The overarching theme in this proposal is that we develop several novel statistical methods for motif discovery in genomic sequence data. Chromatin Immunoprecipitation microarray (ChIP-chip) data allows the direct identification of transcription factor binding sites that are active in particular biological states. Jointly modeling array intensities and DNA sequence will lead to more accurate estimation of binding sites. We develop these joint models to account for multiple motifs and varied relationships between binding sites and array intensities. We also propose a novel joint model framework for direct estimation of a motif using gene expression and the DNA sequence that bypasses computationally expensive motif selection procedures. Chromatin structure, in the form of positioning of nucleosomes in DNA, has long been known to play a huge role in protein-DNA binding, however, a quantitative assessment of this role has not been available until very recently. Taking advantage of the increasing availability of accurate experimental data assessing chromatin features, we propose a novel Bayesian statistical model framework for improving motif detection through integration of nucleosome positioning and genomic sequence data. Alternative splicing of mRNA greatly expands the functional repertoire of many genes in the mammalian genome by including or excluding the exons making up the genetic coding sequence. Standard gene expression arrays fail to capture the variability of the exon composition of mRNA species, but rather give a crude measure of overall gene expression. We propose a method that detects over-representation of specific splice junctions in different biological states while adjusting for overall gene expression. The advent of high-throughput genomic technologies has ushered in a new data-driven era, allowing the ability to measure biological activity on a genome-wide scale. Chromatin Immunoprecipitation (ChIP), histone modification, and FAIRE for example are procedures that benefited from this technology, allowing one to determine relative enrichment for their isolated fragments genome wide. The recent development of Next generation sequencing (NGS) platforms offers greater dynamic range, resolution, and genomic coverage in measuring relative enrichment of DNA fragments compared to microarrays. We develop classes of statistical mixture models based on the zero-inflated negative binomial distribution to model such count data and develop an R software package called Zero-Inflated Negative Binomial Algorithm (ZINBA) to carry out the peak calling for a given dataset. 1

Public Health Relevance

We develop Bayesian methodology for high dimensional genomic data. The overarching theme in this proposal is that we develop several novel statistical methods for motif discovery in genomic sequence data. The proposed methodology has major applications in chronic diseases such as cancer, AIDS, cardiovascular disease, and environmental health. We will develop new statistical methods for ChIP-chip data, integrating chormatin structure into motif discovery, joint modeling of gene expression and sequence data, alternative mRNA splicing, and analysis of next generation sequencing (NGS) data. 1

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM070335-15
Application #
8523908
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brazhnik, Paul
Project Start
1996-03-01
Project End
2015-08-31
Budget Start
2013-09-01
Budget End
2014-08-31
Support Year
15
Fiscal Year
2013
Total Cost
$272,896
Indirect Cost
$48,257
Name
University of North Carolina Chapel Hill
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
608195277
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599
Zhang, Danjie; Chen, Ming-Hui; Ibrahim, Joseph G et al. (2016) JMFit: A SAS Macro for Joint Models of Longitudinal and Survival Data. J Stat Softw 71:
Joeng, Hee-Koung; Chen, Ming-Hui; Kang, Sangwook (2016) Proportional exponentiated link transformed hazards (ELTH) models for discrete time survival data with application. Lifetime Data Anal 22:38-62
Rao, Shangbang; Ibrahim, Joseph G; Cheng, Jian et al. (2016) SR-HARDI: Spatially Regularizing High Angular Resolution Diffusion Imaging. J Comput Graph Stat 25:1195-1211
Schifano, Elizabeth D; Wu, Jing; Wang, Chun et al. (2016) Online Updating of Statistical Inference in the Big Data Setting. Technometrics 58:393-403
Wang, Chun; Chen, Ming-Hui; Schifano, Elizabeth et al. (2016) Statistical methods and computing for big data. Stat Interface 9:399-414
Wang, Wenjie; Chen, Ming-Hui; Chiou, Sy Han et al. (2016) Onset of persistent pseudomonas aeruginosa infection in children with cystic fibrosis with interval censored data. BMC Med Res Methodol 16:122
Zhu, Hongtu; Ibrahim, Joseph G; Chen, Ming-Hui (2015) Diagnostic Measures for the Cox Regression Model with Missing Covariates. Biometrika 102:907-923
Sinha, Arijit; Chi, Zhiyi; Chen, Ming-Hui (2015) BAYESIAN INFERENCE OF HIDDEN GAMMA WEAR PROCESS MODEL FOR SURVIVAL DATA WITH TIES. Stat Sin 25:1613-1635
Yu, Fang; Chen, Ming-Hui; Kuo, Lynn et al. (2015) Confident difference criterion: a new Bayesian differentially expressed gene selection algorithm with applications. BMC Bioinformatics 16:245
M'lan, Cyr Emile; Chen, Ming-Hui (2015) Objective Bayesian Inference for Bilateral Data. Bayesian Anal 10:139-170

Showing the most recent 10 out of 102 publications