How does the DNA sequence of an organism (genotype) determine its form and function (phenotype)? New technologies such as massively parallel reporter assays (MPRAs), deep mutational scanning, and combinatorial CRISPR screens have the potential to expose the genotype-phenotype relationship at an unprecedented level of detail by measuring phenotypes for tens of thousands to millions of genotypes in a single experiment. However, interpreting the results of these experiments is difficult because the space of genotypes is intrinsically high-dimensional and combinations of mutations often interact in complicated ways. My research program is focused on developing new computational tools to analyze data from these high- throughput experiments, with the goals of (1) identifying the major qualitative features of the genotype- phenotype relationship in specific biological systems, (2) explaining how these qualitative features arise from underlying developmental, cell biological and biophysical mechanisms, (3) being able to accurately predict the phenotypes of unmeasured genotypes, and (4) quantifying the uncertainty in these predictions. My primary research objective over the next five years is to develop new computational and statistical techniques capable of capturing higher-order epistasis, that is, genetic interactions that occur between three or more mutations. Although contemporary high-throughput mutagenesis experiments reveal that these higher- order interactions are extremely prevalent, we currently lack general, principled statistical models capable of modeling such interactions. My research group is currently developing two different, but related, methods for modeling these interactions. While both methods display state-of-the-art predictive performance on smaller datasets with tens to hundreds of thousands of genotypes, substantial work remains to adapt these methods to the scale of the largest available datasets, which contain measurements for millions of genotypes. In the coming years, we plan to build these methods into an integrated framework for analyzing complex genetic interactions, complete with quantification of uncertainty, tools for biological interpretation and exploratory data analysis, and practical software that can be used and interpreted by both computational biologists and experimentalists. High-throughput mutagenesis experiments have the potential to transform molecular biology by providing a general-purpose tool for interrogating the genotype-phenotype relationship of an arbitrary genetic element. Important applications include mapping adaptive paths to immune escape and drug resistance variants in infectious disease, designing improved antibodies and enzymes, and genomic variant interpretation. Development of the computational tools proposed here will further these goals by providing a principled and functional framework for understanding the complex genetic interactions revealed in these experiments.
Mutations often interact with each other, so that knowing the effects of individual mutations is not enough to predict how they will combine. This unpredictability makes it difficult to prevent the emergence of antibiotic resistance and immune evasion in infectious disease, complicates efforts to assess the pathogenicity of variants in human genome sequences, and hampers our ability to develop new protein-based therapeutics, antibodies, and enzymes. This project will improve our ability to predict how mutations interact by developing new computational tools to model the combined impact of multiple mutations based on data from high- throughput experiments.