Intellectual merit of the proposed activity. The primary focus of this investigation is the following ubiquitous algorithmic challenge: Given a large data set, provide a "simple" yet "accurate" description of the data.

The question is treated at several levels. In the first it is considered as a data-clustering problem. A clustering algorithm (designed with regard to a particular "distortion" that measures dissimilarity of points) partitions the data into clusters so that the distortion between points within a cluster is minimized (i.e., similarity is maximized), while the distortion across clusters is maximized. In the huge databases prevalent in scientific, engineering, business, government and medical applications, "polynomial time algorithms" are not good enough since a typical algorithm running in quadratic or cubic time is too slow to be executed on gigabytes or terabytes of data. This proposal therefore places an emphasis on near-linear time algorithms. A particular novel approach, recently developed by the proposer, is described.

In the second treatment, more advanced data analysis problems are addressed. In these problems richer and more flexible models for the data are allowed, such as mixture models of several low-degree surfaces. Data analysis in these models is beyond the reach of current approximation algorithms. The data analysis requires partitioning of the data, and so the NP-completeness of clustering is inherited; but in addition, new difficulties are inherited from applied statistics, such as the issue of missing or incomplete records in real world data. The combination requires solution of fundamental new algorithmic and geometric questions.

The third part of the proposal is concerned not with a particular clustering problem but with setting up the right kind of theory for such nonparametric inference problems. The fundamental principle is this: We should seek to cluster only data sets that admit a high-quality clustering. Often, what makes a particular input hard for a clustering algorithm is that its best clustering is little better than average: the input does not separate cleanly into well-separated pieces. A good data analysis should detect this fact, rather than stall on a search for a best yet near-average clustering. Developing a sound theoretical framework for clustering therefore requires departing from the "worst-case analysis" framework that has been dominant so far.

The fourth part of the proposal is devoted to theoretical analysis of the EM clustering algorithm and its variants. EM is an iterative heuristic for which there are few performance guarantees. However, it is fast, and widely used in practice. Because of this, an important goal is to determine under what conditions EM performs well, and what conditions require other approaches.

The fifth part of the proposal is devoted to a different topic: the limitations imposed on feedback Control mechanisms because of limited processing power or limited communication capacity. Because of the real-time nature of control tasks, conventional "input-output" circuit complexity is not an appropriate theoretical framework. Two general types of questions are pursued. The first is characterization of the class of stabilization tasks that can be performed by highly parallelized, ultra-fast digital control circuits. The second is the ensuring of reliable and real-time communications among components of a control system in spite of noise on the communication channels.

Broader impact of the proposed activity. Data analysis is required in many scientific, engineering, business, government and medical applications. The importance of sifting out useful information from huge databases is such, that advances in data analysis algorithms as well as in the mathematical framework underpinning the design of these algorithms, has significant potential benefit to society.

Increasing automation in our technological infrastructure has created a convergence of control, computation and communication technologies. Some of the difficulties associated with carrying out this convergence will be resolved through progress on the questions described in the last section of the proposal.

All topics in the proposal are highly suited to graduate, and in some cases undergraduate, research. Interdisciplinary student training in algorithms, complexity, control and information theory is ongoing (by the PI and several colleagues). Increase in this activity is intended during the period of the proposed investigation.

Project Start
Project End
Budget Start
2005-06-01
Budget End
2009-05-31
Support Year
Fiscal Year
2005
Total Cost
$200,000
Indirect Cost
Name
California Institute of Technology
Department
Type
DUNS #
City
Pasadena
State
CA
Country
United States
Zip Code
91125