High-dimensional unsupervised learning, with applications to genomics

Witten, Daniela

Abstract

. This project involves the development of statistical methodology for the analysis of large- scale genomic data, such as gene expression, DNA copy number, and DNA sequencing data. In genomic studies, the goal is often to identify signal in the data in an unsupervised way. For instance, given the gene expression measurements for a set of patients with lung cancer, one might wish to discover previously unknown lung cancer subtypes that are characterized by distinct gene expression signatures and that might differ with respect to prognosis or response to therapy. However, the search for signal in genomic data is made difficult by the fact that the number of variables (e.g. genes) is generally orders of magnitude greater than the number of observations (e.g. lung cancer patients). As a result, principled methods must be developed to discover signal without overfitting. Furthermore, there is a need for objective ways to assess the validity of results obtained. This proposal has four specific aims, each of which involves the development of a new statistical method for solving a problem that arises in the analysis of genomic data.
Aim 1 : A method to learn multiple related genomic networks at once. For instance, one might expect that the gene expression networks for cancer and normal tissues will look similar to each other, with certain specific differences. The current proposal will provide a way to learn both networks simultaneously, in order to identify gene pathways that are perturbed in cancer. The proposed approach involves applying shrinkage penalties to the Gaussian graphical model formulation for network estimation.
Aim 2 : A principled approach for simultaneously clustering the rows and columns of a data matrix (e.g. patients and genes). The standard approach for discovering signal in genomic data involves clustering rows and columns independently, but the proposed approach will have increased power to discover biologically relevant clusters. The proposed approach involves applying shrinkage penalties to the matrix-variate normal distribution.
Aim 3 : A tool for the integrative analysis of multiple genomic data types collected on a single set of patient samples. For instance, if gene expression data, copy number data, and methylation data are collected for a single set of samples, then this will allow for the discovery of subsets of patients that are characterized by particular signatures of gene expression, copy number variation, and methylation. This could lead to the discovery of clinically relevant subtypes of cancer and other diseases. The proposed approach is an extension of the approach described in Aim 2.
Aim 4 : A flexible framework for the validation of clusters discovered in structured genomic data, such as DNA copy number and single nucleotide polymorphism data, in order to determine whether clusters discovered reflect signal or simply noise. The proposed approach is related to cross-validation, and will be extended to develop a method for the validation of other unsupervised statistical tools, such as those described in Aims 1-3 above. The statistical tools that result from the proposed research will be implemented in freely available software.

Public Health Relevance

A major goal of research in genomics is the development of personalized medicine - treatments for cancer and other diseases that are tailored to an individual based on his or her DNA sequence or other genetic information. Though some advances towards this goal have been made, overall progress has been disappointingly slow due to the difficulty in mining through extremely large genomic data sets in order to discover disease-related information. This project addresses this difficulty via the development of new statistical methods for making sense of genomic data.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: Office of The Director, National Institutes of Health (OD)
Type: Early Independence Award (DP5)
Project #: 5DP5OD009145-04
Application #: 8708556
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Basavappa, Ravi

Project Start: 2011-09-20
Project End: 2016-08-31
Budget Start: 2014-09-01
Budget End: 2015-08-31
Support Year: 4
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: University of Washington
Department: Biostatistics & Other Math Sci
Type: Schools of Public Health
DUNS #

City: Seattle
State: WA
Country: United States
Zip Code: 98195

Related projects


NIH 2015 DP5 OD	High-dimensional unsupervised learning, with applications to genomics Witten, Daniela / University of Washington	$364,050
NIH 2014 DP5 OD	High-dimensional unsupervised learning, with applications to genomics Witten, Daniela / University of Washington
NIH 2013 DP5 OD	High-dimensional unsupervised learning, with applications to genomics Witten, Daniela / University of Washington	$353,914
NIH 2012 DP5 OD	High-dimensional unsupervised learning, with applications to genomics Witten, Daniela / University of Washington	$377,315
NIH 2011 DP5 OD	High-dimensional unsupervised learning, with applications to genomics Witten, Daniela / University of Washington	$377,059

Publications

Petersen, Ashley; Witten, Daniela (2018) Data-adaptive additive modeling. Stat Med :

Petersen, Ashley; Simon, Noah; Witten, Daniela (2018) SCALPEL: EXTRACTING NEURONS FROM CALCIUM IMAGING DATA. Ann Appl Stat 12:2430-2456

Chen, Shizhe; Witten, Daniela; Shojaie, Ali (2017) Nearly assumptionless screening for the mutually-exciting multivariate Hawkes process. Electron J Stat 11:1207-1234

Morrison, Jean; Simon, Noah; Witten, Daniela (2017) Simultaneous detection and estimation of trait associations with genomic phenotypes. Biostatistics 18:147-164

Chen, Shizhe; Shojaie, Ali; Witten, Daniela M (2017) Network Reconstruction From High-Dimensional Ordinary Differential Equations. J Am Stat Assoc 112:1697-1707

Sheng, Elisa; Witten, Daniela; Zhou, Xiao-Hua (2016) Hypothesis testing for differentially correlated features. Biostatistics 17:677-91

Petersen, Ashley; Simon, Noah; Witten, Daniela (2016) Convex Regression with Interpretable Sharp Partitions. J Mach Learn Res 17:

Tan, Kean Ming; Ning, Yang; Witten, Daniela M et al. (2016) Replicates in high dimensions, with applications to latent variable graphical models. Biometrika 103:761-777

Petersen, Ashley; Witten, Daniela; Simon, Noah (2016) Fused Lasso Additive Model. J Comput Graph Stat 25:1005-1025

Haris, Asad; Witten, Daniela; Simon, Noah (2016) Convex Modeling of Interactions with Strong Heredity. J Comput Graph Stat 25:981-1004

Showing the most recent 10 out of 27 publications

Comments

Be the first to comment on Daniela Witten's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: