Large-scale datasets such as those generated by GTEx, the Roadmap Epigenomics Consortium, and the ENCODE project are valuable new resources for understanding the genetic basis of disease. We now have data on gene expression and many functional elements such as histone modifications and DNase-I Hypersensitivity Sites (DHS) in a variety of cell types and tissues in humans. Analysis of these datasets, together with data from genome-wide association studies (GWAS), has the potential to lead to breakthroughs in our understanding of the causes of disease. While statistical and computational methods for integrative analysis of these datasets with GWAS datasets have already led to many interesting advances, there is a great need for further methodological progress to translate this abundance of data into concrete mechanistic insights. We will focus on the fundamental problem of identifying disease-relevant cell types and tissues via integrative analysis of these datasets. Our work is motivated by the fact that the substantial majority of disease heritability lies in non-coding regions, and regulatory elements often exhibit strong cell-type specificity. Thus, to understand the mechanistic consequences of genetic variation by either computational or experimental means, we need to identify the cell types and tissues in which the relevant processes are occurring. While these are known for some complex phenotypes, they are uncertain or unknown for many; for example, while it is known that schizophrenia is a brain disease, recent evidence indicates that the complement system is involved in schizophrenia pathogenesis through its role in synaptic pruning, and the relevant cell types remain unresolved. Despite the importance of this problem, developing a powerful method for identification of cell types and tissues using GWAS data remains open. Our approach will have two components: first, we will develop methods for using genetic data to assess whether a given genomic annotation?i.e. a subset of the genome?is important for the phenotype we are studying. We will build on a method we previously developed for enrichment analysis that powerfully leverages polygenic signal, extending it so that it can analyze rare variant data, combine signal from multiple sources of data about a single cell type/tissue, and investigate shared cell types/tissues across traits. Second, we will use gene expression data and functional genomics data to construct, for each candidate cell type/tissue, genomic annotations that are maximally informative about cell-type specific activity. We will begin by using specifically expressed genes, which have not been fully leveraged in this context, and we will also develop new methods for constructing maximally informative genomic annotations from chromatin data like that available from Roadmap. We will continue our practice of releasing open-source, user-friendly software and data. Together, our new methods and annotations will allow for powerful identification of disease-relevant cell types and tissues from GWAS data, functional genomic data, and gene expression data.
Integrative analysis of GWAS data, gene expression data, and functional genomics data provides an exciting avenue through which to understand common diseases, but more sophisticated computational and statistical techniques are needed to analyze these data. I will focus on developing methods to use these data to identify disease-relevant cell types and tissues, a necessary first step for understanding molecular mechanisms of disease.