Next generation sequencing (NGS) technologies revolutionized the fields of genetics and genomics by allowing rapid and inexpensive sequencing of billions of bases. Although basic analysis tools for each individual data type are abundant, statistical methods that can integrate different sources of data for addressing key, challenging questions are lacking. We propose to develop integrative methods for critical, widely used, applications urgently requiring reliable statistical integration tools. At the core of our methods is effective integration of multiple appropriate data types with novel statistical methods. First, although, to date, large numbers of protein-DNA interactions and histone modifications are mapped, systematic methods that allow users to query these data and generate testable hypotheses are lacking. Second, in parallel to generation of (epi)genomic profiles, genome-wide association studies (GWAS) have been successful at identifying disease and trait-associated genetic variants (GVs). However, our ability to identify causal variants and elucidate the mechanisms by which genotypes influence phenotypes is hampered by significant obstacles. Third, although the utility of reads that map to multiple locations on the reference genome (multi-reads) has been well established for some NGS applications such as RNA-seq and ChIP-seq, all the analysis methods for the emerging data type CLIP-seq that interrogates RNA binding proteins rely on using only reads that map uniquely to reference genome (uni-reads) leading to unreliable inference. We plan to address these critical challenges by developing (1) Fast and scalable integrative statistical methods for joint analysis of multiple ChIP-seq datasets to enable both individual data level inference and identification of joint effects; (2) A statistical analysis framework for integrating GWAS results with the increasing numbers of genome-wide maps of functional annotations; (3) An integrative multi-read mapping framework for studying RNA-protein interactions through CLIP-seq experiments. The projects will be accomplished through a combination of methodological development, simulation, computational analysis, and experimental validation. Methods will be developed and evaluated using datasets from the ENCODE and REMC as well as novel datasets from collaborators. Statistical resources generated from the project will be disseminated in publicly available software. Collectively, these aims will significantly improve the utility of genome-wide data types that are available to researchers.

Public Health Relevance

Large consortia projects have generated massive amounts of genetic and genomic data (e.g., on transcription factor binding, histone modifications, RNA protein binding, and genome-wide association studies) that are currently under-utilized. This project seeks to enhance our knowledge on regulatory mechanisms both at the DNA and RNA level by developing innovative scalable methods for integrative analysis of multiple sources of genomic data.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG003747-08
Application #
9176648
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Gilchrist, Daniel A
Project Start
2007-04-26
Project End
2020-06-30
Budget Start
2016-09-02
Budget End
2017-06-30
Support Year
8
Fiscal Year
2016
Total Cost
$324,963
Indirect Cost
$99,776
Name
University of Wisconsin Madison
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
161202122
City
Madison
State
WI
Country
United States
Zip Code
53715
Zhang, Qi; Keles, Sündüz (2018) An empirical Bayes test for allelic-imbalance detection in ChIP-seq. Biostatistics 19:546-561
Zuo, Chandler; Chen, Kailei; Kele?, Sündüz (2017) A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. J Comput Biol 24:472-485
Kim, TaeWon; Havighurst, Thomas; Kim, KyungMann et al. (2017) RNA-Binding Protein IGF2BP1 in Cutaneous Squamous Cell Carcinoma. J Invest Dermatol 137:772-775
Otlu, Burçak; Firtina, Can; Keles, Sündüz et al. (2017) GLANET: genomic loci annotation and enrichment tool. Bioinformatics 33:2818-2828
Welch, Rene; Chung, Dongjun; Grass, Jeffrey et al. (2017) Data exploration, quality control and statistical analysis of ChIP-exo/nexus experiments. Nucleic Acids Res 45:e145
Shin, Sunyoung; Kele?, Sündüz (2017) Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. Stat Biosci 9:50-72
Kreimer, Anat; Zeng, Haoyang; Edwards, Matthew D et al. (2017) Predicting gene expression in massively parallel reporter assays: A comparative study. Hum Mutat 38:1240-1250
Zuo, Chandler; Chen, Kailei; Hewitt, Kyle J et al. (2016) A Hierarchical Framework for State-Space Matrix Inference and Clustering. Ann Appl Stat 10:1348-1372
Papale, Ligia A; Li, Sisi; Madrid, Andy et al. (2016) Sex-specific hippocampal 5-hydroxymethylcytosine is disrupted in response to acute stress. Neurobiol Dis 96:54-66
Li, Sisi; Papale, Ligia A; Zhang, Qi et al. (2016) Genome-wide alterations in hippocampal 5-hydroxymethylcytosine links plasticity genes to acute stress. Neurobiol Dis 86:99-108

Showing the most recent 10 out of 51 publications