The revolution in genome sequencing technologies over the past 15 years has created an explosion of population genomic data but has left in its wake a gap in our ability to make sense of data at this scale. In particular, whereas population genetics as a field has been traditionally data-limited, the massive volume of current sequencing means that previously unanswerable questions may now be within reach. To capitalize on this flood of information we need new methods and modes of analysis. In the past 5 years the world of machine learning has been revolutionized by the rise of deep neural networks. These so-called deep learning methods offer incredible flexibility as well as astounding improvements in performance for a wide array of machine learning tasks, including computer vision, speech recognition, and natural language processing. This proposal aims to harness the great potential of deep learning for population genetic inference. In recent years our group has made great strides in using supervised machine learning for population genomic analysis (reviewed in Schrider and Kern 2018). However, this work has focused primarily on using more traditional machine learning methods such as random forests. As we argue in this proposal, DNA sequence data are particularly well suited for modern deep learning techniques, and we demonstrate that the application of these methods can rapidly lead to state-of-the-art performance in very difficult population genetic tasks such as estimating rates of recombination. The power of these methods for handling genetic data stems in part from their ability to automatically learn to extract as much useful information as possible from an alignment of DNA sequences in order to solve the task at hand, rather than relying on one or more predefined summary statistics which are generally problem-specific and may omit information present in the raw data. In this proposal we lay out a systematic approach for both empowering the field with these tools and understanding their shortcomings. In particular, we propose to design deep neural networks for solving population genetic problems, and incorporate successful networks into user-friendly software tools that will be shared with the community. We will also investigate a variety of methods for estimating the uncertainty of predictions produced by deep learning methods; this area is understudied in machine learning but of great importance to biological researchers who require an accurate measure of the degree of uncertainty surrounding an estimate. Finally, we will explore the impact of training data misspecification?wherein the data used to train a machine learning method differ systematically from the data to which it will be applied in practice. We will devise techniques to mitigate the impact of such misspecification in order to ensure that our tools will be robust to the complications inherent in analyzing real genomic data sets. Together, these advances have the potential to transform the methodological landscape of population genetic inference.

Public Health Relevance

Deep learning has revolutionized such disparate fields as computer vision, natural language processing, and speech recognition. In this proposal we aim to harness the great potential of deep learning for population genetic inference. We will design, implement, and apply novel deep learning methods and provide open source software for others to both use and build upon, thereby producing valuable tools for the genetics researchers at large.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG010774-01A1
Application #
9976348
Study Section
Genetic Variation and Evolution Study Section (GVE)
Program Officer
Sofia, Heidi J
Project Start
2020-04-21
Project End
2024-02-28
Budget Start
2020-04-21
Budget End
2021-02-28
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Oregon
Department
Biology
Type
Organized Research Units
DUNS #
City
Eugene
State
OR
Country
United States
Zip Code
97403