An important and visible trend in empirical science today is the increasing prevalence of large data sets that contain from thousands to hundreds of millions of measurements. Examples include data sets arising from high throughput measurement techniques such as gene expression arrays, proteomics and computer network monitoring. While the analysis of large data sets is important to scientists, it is often outside the realm of classical statistical methods, and frequently presents new conceptual and computational challenges. The funded research has two principle parts. In the first, the investigators are studying the application of a relatively new development in the field of Data Mining, known as subspace clustering, to the exploratory statistical analysis of high dimensional data. In the second, the investigators are applying ideas from Statistics and Probability to the development of new subspace clustering methods, and to rigorous mathematical analyses of their results. Research is being carried out in the context of ongoing collaborations with biological scientists, and is being incorporated in software that will be used by the collaborating scientists to identify and assess significant sample-variable associations in a variety of large data sets.

An important and visible trend in empirical science today is the increasing prominence of large data sets that contain from thousands to hundreds of millions of measurements. Examples include data sets arising from high throughput measurement techniques such as gene expression arrays, proteomics and computer network monitoring. Whereas small to moderate data sets typically have more samples than measurements, in large data sets it is common to have more measurements than samples, so-called ``high dimension and low sample size''. The investigators are studying the application of data mining methods known as subspace clustering to the exploratory analysis of high dimensional data. Subspace clustering identifies distinguished sample variable interactions (submatrices) in a given data matrix. Unlike standard two-way clustering, the sample and variable sets for different clusters can overlap. The investigators are investigating the noise sensitivity of existing subspace clustering algorithms, and are developing and implementing new subspace clustering methods for average based selection criteria that are better suited for applications where noise is present. As an application of these methods, they are using subspace clusters to classify high dimensional data. Using a variety of tools from combinatorial probability, the investigators are also developing a rigorous theoretical framework in which multiple testing and the statistical significance of subspace clusters can be addressed. The funded research is being carried out in the context of ongoing collaborations with biologists and computer scientists.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0406361
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2004-09-15
Budget End
2008-08-31
Support Year
Fiscal Year
2004
Total Cost
$252,689
Indirect Cost
Name
University of North Carolina Chapel Hill
Department
Type
DUNS #
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599