Although genome-wide association studies (GWAS) have identified thousands of disease susceptibility loci, the underlying genetic structure in these regions is not fully studied and it is likely that the GWAS signal originates from one or many yet unidentified causal variants. In order to localize potential causal variant(s) for further follow-u experiments, fine-mapping studies in large populations are underway. To date, fine-mapping studies have used standard approaches that fail to account for the full array of information currently available such as associations with gene expression (eQTLs) and genomic functional annotation. With the advent of large-scale initiatives such as The Encyclopedia of DNA Elements (ENCODE) and The Cancer Genome Atlas (TCGA), it may be possible to include an additional layer of functional information to fine-mapping studies, enhancing the ability to localize causal variants. We here propose to develop a statistical framework that will incorporate both functional and genetic information. We will build variant-specific priors based on cell-specific functional annotation (e.g. DNase I hypersensitive sites, protein coding), associations with tissue-specific gene expression and correlated phenotypes. We will capitalize on the publically available ENCODE data to acquire functional annotation for each genetic variant. We will then estimate posterior probabilities for each genetic variant based on their derived prior an the evidence for association with the outcome of interest. Such posterior probabilities can then be used to prioritize genetic variants for further follow-up in a laboratory setting. Compared to existing approaches, our proposed method is unique in that it will jointly model internal (e.g. sequencing and gene expression data) and external (e.g. ENCODE, TCGA) sources. It will also allow for multiple causal variants at each region and jointly assess all loci simultaneously, allowing the method to "borrow" information between the regions. To ensure generalizability, we will conduct extensive simulation studies taking numerous possible scenarios into account. We will apply our method on a multi-ethnic breast cancer targeted sequencing dataset of 2,288 breast cancer cases and 2,323 controls for whom we have generated high-depth sequencing data for 12 GWAS-identified breast cancer regions. For a subset of these women, we also have mammographic density (n=1,000) and whole-genome expression data (n=250) in both normal and tumor tissue, allowing us to apply our method and jointly model empirical sequencing, gene expression and phenotype data. We have assembled a multi-disciplinary research team with a track record of producing high-profile publications in fine-mapping, statistical methods, breast cancer epidemiology, population genetics and publicly available software packages for the genetics community. Our work has the potential of bridging the gap between initial screening for regions in the genome that are associated with disease and prioritizing specific variants for further functional analysis. Such methods will have important implications for understanding the underlying biology of disease, a major challenge in the post-GWAS era.
Genome-wide association studies (GWAS) have identified thousands of genetic regions involved in disease but the specific causal genetic variants within each region remain unknown. We here propose a novel statistical approach to fine-mapping that will prioritize plausible causal variants based on recent functional mapping of the genome and high coverage sequencing data. Our methods will close the gap between initial screening of the genome and nominating specific potentially causal genetic variants, one of the grand challenges in the post-GWAS era.
|Kichaev, Gleb; Yang, Wen-Yun; Lindstrom, Sara et al. (2014) Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet 10:e1004722|