This research involves the development of new statistical methods and theory for graphical modeling on the basis of high-dimensional data, in which the number of features exceeds the number of observations. In certain applications, such as the estimation of transcriptional regulatory networks on the basis of gene expression data, existing techniques are inadequate for two reasons: the assumptions that underlie these techniques are insufficient for accurate network recovery in the face of such high dimensionality, and furthermore the assumptions that are made may be unrealistic for the data. To address these two problems, the investigator proposes to study (a) a set of techniques for more effectively learning one or more Gaussian graphical models by making more effective and structured assumptions about the topology of the true conditional dependence networks, via convex penalties and other techniques; and (b) more flexible frameworks for estimating conditional dependence relationships without the usual Gaussianity assumptions.

In recent years, new technologies and fast computers have resulted in the generation and availability of vast amounts of data in fields as diverse as molecular biology, marketing, finance, sociology, linguistics, and computer vision. Unfortunately, analyzing this type of "big data" poses severe statistical challenges, and the classical statistical toolset cannot be applied. Therefore, developing effective statistical machine learning techniques for making sense of very large-scale data sets is crucial for progress in many areas of science as well as industry, in order to bridge the gap between the data that is being collected and the scientific and industrial questions that are being asked about the data. As an example, being able to estimate gene networks on the basis of genomic data has important implications for understanding biological processes, and for making progress towards the treatment of cancer and other disease. This proposal involves (1) developing techniques for improved network estimation on the basis of high-dimensional data sets; (2) disseminating the resulting techniques to the statistical and biomedical communities via publications, seminars, and the public release of software; (3) training PhD students in statistical machine learning techniques for big data; and (4) increasing the exposure of high school students, undergraduates, and members of underrepresented groups to statistical machine learning and big data challenges via short courses, conference presentations, and other activities.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1252624
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2013-07-01
Budget End
2020-06-30
Support Year
Fiscal Year
2012
Total Cost
$400,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195