This proposal will develop a number of novel statistical tools for learning genotype-phenotype mappings from experimental data. Massive genotype-phenotype data sets can be generated by genetic diversification, followed by high-throughput screening/selection and next-generation DNA sequencing of functionally-distinct populations. The resulting data presents new and interesting statistical challenges including large numbers of examples, presence-only responses, and noisy/missing data. Presence-only responses arise because most high-throughput screening/selection methods isolate only functional examples (positive responses), while non-functional examples (negatives) are difficult or impossible to obtain. The resulting data sets contain the initial unlabelled variant library and positive examples. The modeling tools developed in this proposal apply to all levels of biological organization spanning from molecules to ecosystems. The novel statistical methods developed in this proposal will model the relationships between protein sequence, structure, and function, with the goal of gaining insight into biochemical mechanisms and designing new and useful proteins. This proposal will (i) develop new theory and tools to analyze the large quantities of protein sequence function data that are being generated by emerging high-throughput methods; (ii) address challenges associated with positive-unlabeled (PU) learning, extremely large data size, low- quality/missing data, and (iii) encoding side information from existing databases or physical models. Furthermore, applying the methods and algorithms developed in this work will generate novel scientific insights and engineered biological systems.
A detailed understanding of the relationship between a protein's sequence and it's biochemical properties would have a profound impact across all areas of biology, medicine, and biotechnology. This important capability would allow us to diagnose genetic diseases before they manifest symptoms and design new protein therapeutics. The goal of this proposal is to develop new statistical tools for understanding the complex relationships between protein sequence, structure, and function.