A research effort is proposed to create new tools for high dimensional data analysis, focusing on the very challenging regime where signals are both rare and weak. In particular, the proposer proposes to: (a). Develop graphlet screening as a new tool for high dimensional variable selection, introduce a new theoretic framework for assessing the optimality of variable selection, and show that graphlet screening achieves the optimal rate of convergence in terms of Hamming distance of the selection errors. (b). Develop a new method of spectral clustering by using the recent idea of Higher Criticism thresholding, and investigates the fundamental limits for several problems related to low-rank matrix recovery, including high dimensional clustering, sparse Principle Component Analysis, and a testing problem related to the underlying large-size covariance matrix. (c) Extend and apply the proposed methods and theory to the analysis of Big data generated in various scientific fields, including genomics and machine learning.

We are often said that we are entering the era of 'Big Data', where massive datasets consisting of millions of observations are mined for associations and patterns. What is never said about this pervasive trend is that, unfortunately, the signal we are looking for is usually very rare and weak and is hard to find, and it is easy to be fooled. The project introduces new ideas, new tools, and novel theory that are appropriate for rare and weak signals in Big Data, and apply the theory and methods to various scientific fields, including genomics and machine learning.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1208315
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2012-08-01
Budget End
2017-07-31
Support Year
Fiscal Year
2012
Total Cost
$119,999
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213