Novel Methods for Large Scale Presence only Data in Biological Systems Engineering

Raskutti, Garvesh

Abstract

This proposal will develop a number of novel statistical tools for learning genotype-phenotype mappings from experimental data. Massive genotype-phenotype data sets can be generated by genetic diversification, followed by high-throughput screening/selection and next-generation DNA sequencing of functionally-distinct populations. The resulting data presents new and interesting statistical challenges including large numbers of examples, presence-only responses, and noisy/missing data. Presence-only responses arise because most high-throughput screening/selection methods isolate only functional examples (positive responses), while non-functional examples (negatives) are difficult or impossible to obtain. The resulting data sets contain the initial unlabelled variant library and positive examples. The modeling tools developed in this proposal apply to all levels of biological organization spanning from molecules to ecosystems. The novel statistical methods developed in this proposal will model the relationships between protein sequence, structure, and function, with the goal of gaining insight into biochemical mechanisms and designing new and useful proteins. This proposal will (i) develop new theory and tools to analyze the large quantities of protein sequence function data that are being generated by emerging high-throughput methods; (ii) address challenges associated with positive-unlabeled (PU) learning, extremely large data size, low- quality/missing data, and (iii) encoding side information from existing databases or physical models. Furthermore, applying the methods and algorithms developed in this work will generate novel scientific insights and engineered biological systems.

Public Health Relevance

A detailed understanding of the relationship between a protein's sequence and it's biochemical properties would have a profound impact across all areas of biology, medicine, and biotechnology. This important capability would allow us to diagnose genetic diseases before they manifest symptoms and design new protein therapeutics. The goal of this proposal is to develop new statistical tools for understanding the complex relationships between protein sequence, structure, and function.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 1R01GM131381-01
Application #: 9669190
Study Section: Special Emphasis Panel (ZGM1)
Program Officer: Brazhnik, Paul

Project Start: 2018-09-01
Project End: 2021-06-30
Budget Start: 2018-09-01
Budget End: 2019-06-30
Support Year: 1
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Wisconsin Madison
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 161202122

City: Madison
State: WI
Country: United States
Zip Code: 53715

Related projects


NIH 2020 R01 GM	Novel Methods for Large Scale Presence only Data in Biological Systems Engineering Raskutti, Garvesh / University of Wisconsin Madison
NIH 2019 R01 GM	Novel Methods for Large Scale Presence only Data in Biological Systems Engineering Raskutti, Garvesh / University of Wisconsin Madison
NIH 2018 R01 GM	Novel Methods for Large Scale Presence only Data in Biological Systems Engineering Raskutti, Garvesh / University of Wisconsin Madison

Comments

Be the first to comment on Garvesh Raskutti's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: