Large-scale genetic data has become increasingly available, providing many clues about the genetic basis of human traits and diseases. However, ~90% of genome-wide association study (GWAS) signals are from common, noncoding variants. Such variants are difficult to interpret and connect to biological insight because (1) there are generally multiple possible causal genes at each locus and (2) linking noncoding variants to the genes they regulate is challenging, which obscures the biological processes involved. Many strategies for mitigating this problem have been proposed. One such strategy is applying algorithms that look for commonalities across loci, which can implicate biological processes through gene set enrichment analysis, pinpoint important epigenetic marks, and/or prioritize likely causal genes. A second strategy is assaying rare and low-frequency coding variants, which have less linkage disequilibrium and therefore more directly pinpoint causal genes. To facilitate progress from genetic association to biological hypotheses, I propose extending and combining these strategies to develop several different methods. Specifically, I will take DEPICT, a gene set enrichment analysis method developed by our lab for GWAS data, and adapt it for use with the ExomeChip (which genotypes coding rare- and low-frequency variants). DEPICT has the particular advantage of using gene sets that have been extended via coexpression data to make predictions about the function of uncharacterized genes, so it is especially powerful as a tool for biological interpretation. DEPICT also prioritizes causal genes for GWAS based on the similarity of their gene set memberships. However, many other methods for prioritization have been developed and it is difficult to know which are the most accurate. I will therefore develop a method for rigorous comparison of existing prioritization strategies and use it to determine which are the most effective. Finally, interpretation of whole-genome sequencing data is hampered by the challenge of linking noncoding variants to genes and the large sample sizes needed to achieve sufficient power to detect associations. I will develop a gene-set-based approach that incorporates epigenetic information and will use it to improve power for detection of rare and low-frequency noncoding variation in anthropometric traits. Completion of this project will result in a number of useful tools for extracting biological insight from genetic association data that draw on the advantages of gene set enrichment analysis and the power of focusing on rare- and low-frequency variation.

Public Health Relevance

Large-scale genetic studies have revealed much about the genetic basis of human traits and diseases. However, linking genetic associations to insights about biology is still challenging. I will develop tools that address this problem, focusing particularly on methods that assess groups of genes together, such as gene set enrichment analysis, and on the analysis of rare- and low-frequency genetic variation.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Predoctoral Individual National Research Service Award (F31)
Project #
5F31HG009850-02
Application #
9545569
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Gatlin, Christine L
Project Start
2017-09-30
Project End
2020-08-31
Budget Start
2018-09-01
Budget End
2019-08-31
Support Year
2
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Harvard Medical School
Department
Biology
Type
Schools of Medicine
DUNS #
047006379
City
Boston
State
MA
Country
United States
Zip Code
Turcot, Valérie (see original citation for additional authors) (2018) Protein-altering variants associated with body mass index implicate pathways that control energy intake and expenditure in obesity. Nat Genet 50:26-41