We will investigate potential disease-associated genetic variants in the non-coding regions of the human genome. Recent work in the ENCODE project and in population-scale RNA sequencing has contributed significantly to our knowledge of non-coding elements. Thus, given the focus on coding variation in many previous disease studies, there is much untapped potential in exploring the non-coding variation associated with disease. We plan to prioritize rare, germline non-coding variants for connection to disease, using a generalized framework that we will tune specifically to Prostate Cancer as a test case. Our approach will build upon our existing tool, FunSeq, which prioritizes rare somatic variants in cancer, to create eleVAR - elevating germline VARiants. FunSeq was developed to prioritize somatic variants in regions of the genome depleted of common variants in the general population, based on data from the 1000 Genomes project. eleVAR will use this general principle to analyze germline variations, and build upon it by adding several key features, including: (i) prioritizing variants leading to gain of new transcription-factor (TF) binding sites(in addition to disruption of existing sites), (ii) annotating variants in enhancers and connecting them to target genes, (iii) prioritizing variants highly connected in a variety of biological networks, (iv) annotating variants in non-coding RNAs similarly to those in TF binding sites, and (v) prioritizing variants associated with variable, allele-specific activity. Our second objective s to use eleVAR to prioritize variants in whole genome sequences from the TCGA/ICGC consortium. Our efficient implementation of eleVAR will include a module for updating parameters in response to high throughput experimental data. We will progressively tune and evaluate eleVAR, first using publicly available data, and then using multiple rounds of high throughput experimental characterization of variants occurring specifically in prostate cancer. Our last objective is to functionally validate a subset of variants in details. First, we will idenify variants in the 6 representative eleVAR positives and look at their frequency of occurrence in a large prostate cancer cohort using targeted re-sequencing. We will use the CRISPR/Cas system to generate endogenous mutations, determining their effects on target gene expression, cell morphology and tumorigenicity, and TF binding by EMSA and chromatin immunoprecipitation.

Public Health Relevance

We plan to prioritize rare, germline variants associated with disease for functional impact, using prostate cancer as a test case. We will focus on variants in non-coding regions - a category of variant underrepresented in previous studies. Utilizing a range of genomics data, our goal is to prioritize variants for validation with our eleVAR pipeline.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-M (J1))
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Yale University
Schools of Medicine
New Haven
United States
Zip Code
Lochovsky, Lucas; Zhang, Jing; Gerstein, Mark (2018) MOAT: efficient detection of highly mutated regions with the Mutations Overburdening Annotations Tool. Bioinformatics 34:1031-1033
McGillivray, Patrick; Ault, Russell; Pawashe, Mayur et al. (2018) A comprehensive catalog of predicted functional upstream open reading frames in humans. Nucleic Acids Res 46:3326-3338
Li, Shantao; Shuch, Brian M; Gerstein, Mark B (2017) Whole-genome analysis of papillary kidney cancer finds significant noncoding alterations. PLoS Genet 13:e1006685
Balasubramanian, Suganthi; Fu, Yao; Pawashe, Mayur et al. (2017) Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes. Nat Commun 8:382
Chen, Jieming; Wang, Bo; Regan, Lynne et al. (2017) Intensification: A Resource for Amplifying Population-Genetic Signals with Protein Repeats. J Mol Biol 429:435-445