Numerical simulations are replacing traditional experiments in gaining insights into complex physical phenomena. Given recent advances in computer hardware and numerical methods, it is now possible to simulate physical phenomena at very fine temporal and spatial resolutions. As a result, the the amount of data generated is overwhelming.

Scientists are interested in analyzing and visualizing the data produced by such simulations to better understand the process that is being simulated. Analyzing such large scale data is hard. Not only the methods used are computationally expense, current programming tools make the analysis difficult to specify and modify. Thus, there is a dire need for a systematic approach, along with supporting algorithms and methodologies for flexible parallel implementations, to achieve scalable and interactive analysis on large scientific datasets.

In this project, we propose the construction of such a scalable toolkit, namely the Computational Analysis Toolkit (CAT). This toolkit proposes to exploit ongoing work in feature analysis, scalable data mining and parallel programing environments. The crux of the approach is feature-mining; a process where by regions are delineated through various stages of detection, verification, de-noising, and tracking of points of interest. Additionally, we propose the use of some key data mining mining algorithms for achieving enhanced and robust implementations of feature-mining algorithms.

It is our objective that the CAT toolkit should not only allow for the detection of features but also provide for a means to control the analysis in an interactive setting. For example, demographic and lifetime analysis of certain critical features as determined by the user/scientist may be an important way of understanding the underlying process being simulated. These critical features, once tagged via a suitable interface, can be profiled and a concise representation this profile can then be presented to the user as needed.

We believe that for long-term use of a tool for feature and data mining, it is important that a) the algorithms are parallelized on a variety of platforms, b) the parallel implementations are easy to maintain and modify, and c) APIs are available for users to rapidly create scalable implementations of new mining algorithms. We are proposing to achieve these goals by using and extending a parallelization framework developed locally. This framework, referred to as FRamework for Rapid Implementations of Datamining Engines (FREERIDE), offers high-level APIs and runtime techniques to enable parallelization of algorithms for data mining and related tasks. It allows parallelization on both distributed memory and shared memory configurations, and further supports efficient processing of disk-resident datasets.

The proposal, besides providing a useful toolkit, is likely engender the use of methodologies for large data exploration. Our efforts are likely to contribute to literature in scalable data and feature mining algorithms, and feature profile summarization.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0234273
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2003-09-15
Budget End
2006-08-31
Support Year
Fiscal Year
2002
Total Cost
$373,007
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210