Data-Driven Statistical Learning with Applications to Genomics

Simon, Noah

Abstract

This project involves the development of statistical and computational methods for the analysis of high throughput biological data. Effective methods for analyzing this data must balance two opposing ideals. They must be (a) flexible and sufficiently data-adaptive to deal with the data's complex structure, yet (b) sufficiently simpe and transparent to interpret their results and analyze their uncertainty (so as not to mislead with conviction). This is additionally challenging because these datasets are massive, so attacking these problems requires a marriage of statistical and computational ideas. This project develops frameworks for attacking several problems involving this biological data. These frameworks balance flexibility and simplicity and are computationally tractable even on massive datasets. This application has three specific aims.
Aim 1 : A flexible and computationally tractable framework for building predictive models. Commonly we are interested in modelling phenotypic traits of an individual using omics data. We would like to find a small subset of genetic features which are important in phenotype expression level. In this approach, I propose a method for flexibly modelling a response variable (e.g. phenotype) with a small, adaptively chosen subset of features, in a computationally scalable fashion.
Aim 2 : A framework for jointly identifying and testing regions which differ across conditions. For example, in the context of methylation data measured in normal and cancer tissue samples, one might expect that some regions are more methylated in one tissue type or the other. These regions might suggest targets for therapy. However, we do not have the background biological knowledge to pre-specify regions to test. I propose an approach which adaptively selects regions and then tests them in a principled way. This approach is based on a convex formulation to the problem, using shrinkage to achieve sparse differences.
Aim 3 : A principled framework for developing and evaluating predictive biomarkers during clinical trials. Modern treatments target specific genetic abnormalities that are generally present in only a subset of patients with a disease. A major current goal in medicine is to develop biomarkers that identify those patients likely to benefit from treatment. I propose a framework for developing and testing biomarkers during large-scale clinical trials. This framework simultaneously builds these biomarkers and applies them to restrict enrollment into the trial to only those likely to benefit from treatment. The statistical tools that result from th proposed research will be implemented in freely available software.

Public Health Relevance

Recent advances in high-throughput biotechnology have provided us with a wealth of new biological data, a large step towards unlocking the tantalizing promise of personalized medicine: the tailoring of treatment to the genetic makeup of each individual and disease. However, classical statistical and computational tools have proven unable to exploit the extensive information these new experimental technologies bring to bear. This project focuses on building new flexible, data-adaptive tools to translate this wealth of low level information into actionable discoveries, and actual biological understanding.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: Office of The Director, National Institutes of Health (OD)
Type: Early Independence Award (DP5)
Project #: 5DP5OD019820-05
Application #: 9559432
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Miller, Becky

Project Start: 2014-09-18
Project End: 2019-08-31
Budget Start: 2018-09-01
Budget End: 2019-08-31
Support Year: 5
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Washington
Department: Biostatistics & Other Math Sci
Type: Schools of Public Health
DUNS #: 605799469

City: Seattle
State: WA
Country: United States
Zip Code: 98195

Related projects


NIH 2018 DP5 OD	Data-Driven Statistical Learning with Applications to Genomics Simon, Noah / University of Washington
NIH 2017 DP5 OD	Data-Driven Statistical Learning with Applications to Genomics Simon, Noah / University of Washington
NIH 2016 DP5 OD	Data-Driven Statistical Learning with Applications to Genomics Simon, Noah / University of Washington	$324,170
NIH 2015 DP5 OD	Data-Driven Statistical Learning with Applications to Genomics Simon, Noah / University of Washington	$329,423
NIH 2014 DP5 OD	Data-Driven Statistical Learning with Applications to Genomics Simon, Noah / University of Washington	$361,064

Publications

Roth, Jeremy; Simon, Noah (2018) A framework for estimating and testing qualitative interactions with applications to predictive biomarkers. Biostatistics 19:263-280

Simon, Noah; Simon, Richard (2018) Using Bayesian modeling in frequentist adaptive enrichment designs. Biostatistics 19:27-41

Morrison, Jean; Simon, Noah; Witten, Daniela (2017) Simultaneous detection and estimation of trait associations with genomic phenotypes. Biostatistics 18:147-164

Choi, Minseung; Genereux, Diane P; Goodson, Jamie et al. (2017) Epigenetic memory via concordant DNA methylation is inversely correlated to developmental potential of mammalian cells. PLoS Genet 13:e1007060

Petersen, Ashley; Witten, Daniela; Simon, Noah (2016) Fused Lasso Additive Model. J Comput Graph Stat 25:1005-1025

Haris, Asad; Witten, Daniela; Simon, Noah (2016) Convex Modeling of Interactions with Strong Heredity. J Comput Graph Stat 25:981-1004

Petersen, Ashley; Simon, Noah; Witten, Daniela (2016) Convex Regression with Interpretable Sharp Partitions. J Mach Learn Res 17:

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: