This project involves the development of statistical and computational methods for the analysis of high throughput biological data. Effective methods for analyzing this data must balance two opposing ideals. They must be (a) flexible and sufficiently data-adaptive to deal with the data's complex structure, yet (b) sufficiently simpe and transparent to interpret their results and analyze their uncertainty (so as not to mislead with conviction). This is additionally challenging because these datasets are massive, so attacking these problems requires a marriage of statistical and computational ideas. This project develops frameworks for attacking several problems involving this biological data. These frameworks balance flexibility and simplicity and are computationally tractable even on massive datasets. This application has three specific aims.
Aim 1 : A flexible and computationally tractable framework for building predictive models. Commonly we are interested in modelling phenotypic traits of an individual using omics data. We would like to find a small subset of genetic features which are important in phenotype expression level. In this approach, I propose a method for flexibly modelling a response variable (e.g. phenotype) with a small, adaptively chosen subset of features, in a computationally scalable fashion.
Aim 2 : A framework for jointly identifying and testing regions which differ across conditions. For example, in the context of methylation data measured in normal and cancer tissue samples, one might expect that some regions are more methylated in one tissue type or the other. These regions might suggest targets for therapy. However, we do not have the background biological knowledge to pre-specify regions to test. I propose an approach which adaptively selects regions and then tests them in a principled way. This approach is based on a convex formulation to the problem, using shrinkage to achieve sparse differences.
Aim 3 : A principled framework for developing and evaluating predictive biomarkers during clinical trials. Modern treatments target specific genetic abnormalities that are generally present in only a subset of patients with a disease. A major current goal in medicine is to develop biomarkers that identify those patients likely to benefit from treatment. I propose a framework for developing and testing biomarkers during large-scale clinical trials. This framework simultaneously builds these biomarkers and applies them to restrict enrollment into the trial to only those likely to benefit from treatment. The statistical tools that result from th proposed research will be implemented in freely available software.

Public Health Relevance

Recent advances in high-throughput biotechnology have provided us with a wealth of new biological data, a large step towards unlocking the tantalizing promise of personalized medicine: the tailoring of treatment to the genetic makeup of each individual and disease. However, classical statistical and computational tools have proven unable to exploit the extensive information these new experimental technologies bring to bear. This project focuses on building new flexible, data-adaptive tools to translate this wealth of low level information into actionable discoveries, and actual biological understanding.

National Institute of Health (NIH)
Office of The Director, National Institutes of Health (OD)
Early Independence Award (DP5)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Miller, Becky
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Roth, Jeremy; Simon, Noah (2018) A framework for estimating and testing qualitative interactions with applications to predictive biomarkers. Biostatistics 19:263-280
Simon, Noah; Simon, Richard (2018) Using Bayesian modeling in frequentist adaptive enrichment designs. Biostatistics 19:27-41
Morrison, Jean; Simon, Noah; Witten, Daniela (2017) Simultaneous detection and estimation of trait associations with genomic phenotypes. Biostatistics 18:147-164
Choi, Minseung; Genereux, Diane P; Goodson, Jamie et al. (2017) Epigenetic memory via concordant DNA methylation is inversely correlated to developmental potential of mammalian cells. PLoS Genet 13:e1007060
Petersen, Ashley; Simon, Noah; Witten, Daniela (2016) Convex Regression with Interpretable Sharp Partitions. J Mach Learn Res 17:
Petersen, Ashley; Witten, Daniela; Simon, Noah (2016) Fused Lasso Additive Model. J Comput Graph Stat 25:1005-1025
Haris, Asad; Witten, Daniela; Simon, Noah (2016) Convex Modeling of Interactions with Strong Heredity. J Comput Graph Stat 25:981-1004