Predicting phenotypes from DNA sequence variation is a major goal for genetics with potential applications in evolutionary biology, crop breeding, and public health. A central challenge in this task is separating genetic and environmental effects on phenotypes. In natural populations breeding structure is often correlated with the environment across space such that different subpopulations experience different environments. For genome-wide association studies (GWAS) this creates a problem: genetic and environmental effects can be confounded by population structure, leading to inflated test statistics and low predictive power across populations (Bulik-Sullivan et al. 2015, Mathieson and Mcvean, 2012). Understanding when association studies are biased by population stratification and creating better methods to correct for it are thus important challenges for population genetics over the next decade. To identify conditions under which existing methods of population stratification correction are subject to bias and develop robust new alternatives suitable for use with the continental-scale genomic datasets that are now routinely available for humans, we propose to use simulations and machine learning to separate the signals of fine-scale ancestry from polygenic phenotype association. In our first aim we will develop simulations of polygenic phenotype evolution in continuous space and use the output to evaluate existing methods of stratification control including linear mixed models, PC correction, and LD score regression. In this aim we will seek to identify the regions of parameter space ? i.e. the strength of isolation by distance and the spatial distribution of environmental variation ? in which existing methods can be expected to produce reliable effect size estimates, and establish guidelines for applications of GWAS to structured populations. We will then train machine learning algorithms on real genotype data from humans and mosquitoes to describe continuous structure in large spatial samples using a variational autoencoder, a dimensionality reduction technique based on deep neural networks that can take advantage of both allele frequency and haplotype-based measures of differentiation in a single analysis and thus offer improved control of stratification inflation in GWAS relative to the now standard PCA regression approach. Last we will apply deep learning techniques to the problem of linking phenotypes and genotypes in structured samples by training neural networks on simulated phenotypes and empirical genetic data. By training our networks on empirical genetic data and incorporating contextual information about surrounding haplotype structure into the model, our networks should learn to discriminate causal associations from false positives created by population structure in the sample cohort, which will improve performance when attempting to identify associations with the real phenotype. These methods will be applied to existing genomic datasets of height in humans, tested against the current state-of-the-art approaches, and packaged as scalable software for the broader scientific community.

Public Health Relevance

Separating the signals of polygenic trait association and population structure has emerged as a major challenge for the interpretation of genome-wide association studies (GWAS). We propose to develop new simulations of populations evolving in continuous space that will allow us to rigorously benchmark existing methods of stratification control in GWAS while fully controlling the underlying demographic and selective process. We will then apply deep learning techniques to develop (1) a new method of dimensionality reduction to test as a covariate for ancestry in GWAS, and (2) a neural network that identifies genotype-phenotype connections while controlling for population structure in the sample cohort.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Postdoctoral Individual National Research Service Award (F32)
Project #
1F32GM136123-01
Application #
9910009
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Sakalian, Michael
Project Start
2020-05-01
Project End
2022-04-30
Budget Start
2020-05-01
Budget End
2021-04-30
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Oregon
Department
Biology
Type
Organized Research Units
DUNS #
City
Eugene
State
OR
Country
United States
Zip Code
97403