This project involves the development of statistical and computational methods for the analysis of high throughput biological data. Effective methods for analyzing this data must balance two opposing ideals. They must be (a) flexible and sufficiently data-adaptive to deal with the data's complex structure, yet (b) sufficiently simpe and transparent to interpret their results and analyze their uncertainty (so as not to mislead with conviction). This is additionally challenging because these datasets are massive, so attacking these problems requires a marriage of statistical and computational ideas. This project develops frameworks for attacking several problems involving this biological data. These frameworks balance flexibility and simplicity and are computationally tractable even on massive datasets. This application has three specific aims.
Aim 1 : A flexible and computationally tractable framework for building predictive models. Commonly we are interested in modelling phenotypic traits of an individual using omics data. We would like to find a small subset of genetic features which are important in phenotype expression level. In this approach, I propose a method for flexibly modelling a response variable (e.g. phenotype) with a small, adaptively chosen subset of features, in a computationally scalable fashion.
Aim 2 : A framework for jointly identifying and testing regions which differ across conditions. For example, in the context of methylation data measured in normal and cancer tissue samples, one might expect that some regions are more methylated in one tissue type or the other. These regions might suggest targets for therapy. However, we do not have the background biological knowledge to pre-specify regions to test. I propose an approach which adaptively selects regions and then tests them in a principled way. This approach is based on a convex formulation to the problem, using shrinkage to achieve sparse differences.
Aim 3 : A principled framework for developing and evaluating predictive biomarkers during clinical trials. Modern treatments target specific genetic abnormalities that are generally present in only a subset of patients with a disease. A major current goal in medicine is to develop biomarkers that identify those patients likely to benefit from treatment. I propose a framework for developing and testing biomarkers during large-scale clinical trials. This framework simultaneously builds these biomarkers and applies them to restrict enrollment into the trial to only those likely to benefit from treatment. The statistical tools that result from th proposed research will be implemented in freely available software.

Public Health Relevance

Recent advances in high-throughput biotechnology have provided us with a wealth of new biological data, a large step towards unlocking the tantalizing promise of personalized medicine: the tailoring of treatment to the genetic makeup of each individual and disease. However, classical statistical and computational tools have proven unable to exploit the extensive information these new experimental technologies bring to bear. This project focuses on building new flexible, data-adaptive tools to translate this wealth of low level information into actionable discoveries, and actual biological understanding.

Agency
National Institute of Health (NIH)
Institute
Office of The Director, National Institutes of Health (OD)
Type
Early Independence Award (DP5)
Project #
5DP5OD019820-05
Application #
9559432
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Miller, Becky
Project Start
2014-09-18
Project End
2019-08-31
Budget Start
2018-09-01
Budget End
2019-08-31
Support Year
5
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of Washington
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195
Simon, Noah; Simon, Richard (2018) Using Bayesian modeling in frequentist adaptive enrichment designs. Biostatistics 19:27-41
Roth, Jeremy; Simon, Noah (2018) A framework for estimating and testing qualitative interactions with applications to predictive biomarkers. Biostatistics 19:263-280
Morrison, Jean; Simon, Noah; Witten, Daniela (2017) Simultaneous detection and estimation of trait associations with genomic phenotypes. Biostatistics 18:147-164
Choi, Minseung; Genereux, Diane P; Goodson, Jamie et al. (2017) Epigenetic memory via concordant DNA methylation is inversely correlated to developmental potential of mammalian cells. PLoS Genet 13:e1007060
Petersen, Ashley; Simon, Noah; Witten, Daniela (2016) Convex Regression with Interpretable Sharp Partitions. J Mach Learn Res 17:
Petersen, Ashley; Witten, Daniela; Simon, Noah (2016) Fused Lasso Additive Model. J Comput Graph Stat 25:1005-1025
Haris, Asad; Witten, Daniela; Simon, Noah (2016) Convex Modeling of Interactions with Strong Heredity. J Comput Graph Stat 25:981-1004