Next generation sequencing (NGS) technologies revolutionized the fields of genetics and genomics by allowing rapid and inexpensive sequencing of billions of bases. Although basic analysis tools for each individual data type are abundant, statistical methods that can integrate different sources of data for addressing key, challenging questions are lacking. We propose to develop integrative methods for critical, widely used, applications urgently requiring reliable statistical integration tools. At the core of our methods is effective integration of multiple appropriate data types with novel statistical methods. First, although, to date, large numbers of protein-DNA interactions and histone modifications are mapped, systematic methods that allow users to query these data and generate testable hypotheses are lacking. Second, in parallel to generation of (epi)genomic profiles, genome-wide association studies (GWAS) have been successful at identifying disease and trait-associated genetic variants (GVs). However, our ability to identify causal variants and elucidate the mechanisms by which genotypes influence phenotypes is hampered by significant obstacles. Third, although the utility of reads that map to multiple locations on the reference genome (multi-reads) has been well established for some NGS applications such as RNA-seq and ChIP-seq, all the analysis methods for the emerging data type CLIP-seq that interrogates RNA binding proteins rely on using only reads that map uniquely to reference genome (uni-reads) leading to unreliable inference. We plan to address these critical challenges by developing (1) Fast and scalable integrative statistical methods for joint analysis of multiple ChIP-seq datasets to enable both individual data level inference and identification of joint effects; (2) A statistical analysis framework for integrating GWAS results with the increasing numbers of genome-wide maps of functional annotations; (3) An integrative multi-read mapping framework for studying RNA-protein interactions through CLIP-seq experiments. The projects will be accomplished through a combination of methodological development, simulation, computational analysis, and experimental validation. Methods will be developed and evaluated using datasets from the ENCODE and REMC as well as novel datasets from collaborators. Statistical resources generated from the project will be disseminated in publicly available software. Collectively, these aims will significantly improve the utility of genome-wide data types that are available to researchers.
Large consortia projects have generated massive amounts of genetic and genomic data (e.g., on transcription factor binding, histone modifications, RNA protein binding, and genome-wide association studies) that are currently under-utilized. This project seeks to enhance our knowledge on regulatory mechanisms both at the DNA and RNA level by developing innovative scalable methods for integrative analysis of multiple sources of genomic data.
Showing the most recent 10 out of 51 publications