This proposal will develop a number of novel statistical tools for learning genotype-phenotype mappings from experimental data. Massive genotype-phenotype data sets can be generated by genetic diversification, followed by high-throughput screening/selection and next-generation DNA sequencing of functionally-distinct populations. The resulting data presents new and interesting statistical challenges including large numbers of examples, presence-only responses, and noisy/missing data. Presence-only responses arise because most high-throughput screening/selection methods isolate only functional examples (positive responses), while non-functional examples (negatives) are difficult or impossible to obtain. The resulting data sets contain the initial unlabelled variant library and positive examples. The modeling tools developed in this proposal apply to all levels of biological organization spanning from molecules to ecosystems. The novel statistical methods developed in this proposal will model the relationships between protein sequence, structure, and function, with the goal of gaining insight into biochemical mechanisms and designing new and useful proteins. This proposal will (i) develop new theory and tools to analyze the large quantities of protein sequence function data that are being generated by emerging high-throughput methods; (ii) address challenges associated with positive-unlabeled (PU) learning, extremely large data size, low- quality/missing data, and (iii) encoding side information from existing databases or physical models. Furthermore, applying the methods and algorithms developed in this work will generate novel scientific insights and engineered biological systems.

Public Health Relevance

A detailed understanding of the relationship between a protein's sequence and it's biochemical properties would have a profound impact across all areas of biology, medicine, and biotechnology. This important capability would allow us to diagnose genetic diseases before they manifest symptoms and design new protein therapeutics. The goal of this proposal is to develop new statistical tools for understanding the complex relationships between protein sequence, structure, and function.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM131381-01
Application #
9669190
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Brazhnik, Paul
Project Start
2018-09-01
Project End
2021-06-30
Budget Start
2018-09-01
Budget End
2019-06-30
Support Year
1
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of Wisconsin Madison
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
161202122
City
Madison
State
WI
Country
United States
Zip Code
53715