High dimensional statistical data integration for studying regulatory variation

Keles, Sunduz

Abstract

Next generation sequencing (NGS) technologies revolutionized the fields of genetics and genomics by allowing rapid and inexpensive sequencing of billions of bases. Although basic analysis tools for each individual data type are abundant, statistical methods that can integrate different sources of data for addressing key, challenging questions are lacking. We propose to develop integrative methods for critical, widely used, applications urgently requiring reliable statistical integration tools. At the core of our methods is effective integration of multiple appropriate data types with novel statistical methods. First, although, to date, large numbers of protein-DNA interactions and histone modifications are mapped, systematic methods that allow users to query these data and generate testable hypotheses are lacking. Second, in parallel to generation of (epi)genomic profiles, genome-wide association studies (GWAS) have been successful at identifying disease and trait-associated genetic variants (GVs). However, our ability to identify causal variants and elucidate the mechanisms by which genotypes influence phenotypes is hampered by significant obstacles. Third, although the utility of reads that map to multiple locations on the reference genome (multi-reads) has been well established for some NGS applications such as RNA-seq and ChIP-seq, all the analysis methods for the emerging data type CLIP-seq that interrogates RNA binding proteins rely on using only reads that map uniquely to reference genome (uni-reads) leading to unreliable inference. We plan to address these critical challenges by developing (1) Fast and scalable integrative statistical methods for joint analysis of multiple ChIP-seq datasets to enable both individual data level inference and identification of joint effects; (2) A statistical analysis framework for integrating GWAS results with the increasing numbers of genome-wide maps of functional annotations; (3) An integrative multi-read mapping framework for studying RNA-protein interactions through CLIP-seq experiments. The projects will be accomplished through a combination of methodological development, simulation, computational analysis, and experimental validation. Methods will be developed and evaluated using datasets from the ENCODE and REMC as well as novel datasets from collaborators. Statistical resources generated from the project will be disseminated in publicly available software. Collectively, these aims will significantly improve the utility of genome-wide data types that are available to researchers.

Public Health Relevance

Large consortia projects have generated massive amounts of genetic and genomic data (e.g., on transcription factor binding, histone modifications, RNA protein binding, and genome-wide association studies) that are currently under-utilized. This project seeks to enhance our knowledge on regulatory mechanisms both at the DNA and RNA level by developing innovative scalable methods for integrative analysis of multiple sources of genomic data.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG003747-10
Application #: 9524821
Study Section: Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer: Gilchrist, Daniel A

Project Start: 2007-04-26
Project End: 2020-06-30
Budget Start: 2018-07-01
Budget End: 2019-06-30
Support Year: 10
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Wisconsin Madison
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 161202122

City: Madison
State: WI
Country: United States
Zip Code: 53715

Related projects

Publications

Zhang, Qi; Keles, Sündüz (2018) An empirical Bayes test for allelic-imbalance detection in ChIP-seq. Biostatistics 19:546-561

Zuo, Chandler; Chen, Kailei; Kele?, Sündüz (2017) A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets. J Comput Biol 24:472-485

Kim, TaeWon; Havighurst, Thomas; Kim, KyungMann et al. (2017) RNA-Binding Protein IGF2BP1 in Cutaneous Squamous Cell Carcinoma. J Invest Dermatol 137:772-775

Otlu, Burçak; Firtina, Can; Keles, Sündüz et al. (2017) GLANET: genomic loci annotation and enrichment tool. Bioinformatics 33:2818-2828

Welch, Rene; Chung, Dongjun; Grass, Jeffrey et al. (2017) Data exploration, quality control and statistical analysis of ChIP-exo/nexus experiments. Nucleic Acids Res 45:e145

Shin, Sunyoung; Kele?, Sündüz (2017) Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. Stat Biosci 9:50-72

Kreimer, Anat; Zeng, Haoyang; Edwards, Matthew D et al. (2017) Predicting gene expression in massively parallel reporter assays: A comparative study. Hum Mutat 38:1240-1250

Zuo, Chandler; Chen, Kailei; Hewitt, Kyle J et al. (2016) A Hierarchical Framework for State-Space Matrix Inference and Clustering. Ann Appl Stat 10:1348-1372

Papale, Ligia A; Li, Sisi; Madrid, Andy et al. (2016) Sex-specific hippocampal 5-hydroxymethylcytosine is disrupted in response to acute stress. Neurobiol Dis 96:54-66

Li, Sisi; Papale, Ligia A; Zhang, Qi et al. (2016) Genome-wide alterations in hippocampal 5-hydroxymethylcytosine links plasticity genes to acute stress. Neurobiol Dis 86:99-108

Showing the most recent 10 out of 51 publications

Comments

Be the first to comment on Sunduz Keles's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: