The objective of this project is to support the Data Grid infrastructure by developing new and original techniques for efficient storage, retrieval, and analysis of complex scientific data. The main focus is on the development of a highly scalable data engine geared toward the needs of analytical computing in Data Grid environments. However, keys to the realization of the project are the advances in the areas of indexing and clustering data in multi-dimensional spaces.
While the main goal of analytical computing in Data Grid environments is to facilitate hypothesis formulation or to test the validity of a postulated scientific model, its primary method is usually that of data clustering. Since typical analytical tasks also rely on ad-hoc data exploration, any data engine for Grid-enabled analytical computing must support an integrated set of different retrieval and clustering techniques. The data engine developed in this project will feature: an efficient and scalable indexing technique for data in high-dimensional spaces, which will include a practical solution for handling data with missing information; a new and original access method for similarity searching in multi-dimensional spaces; and an original technique for clustering large volumes of multi-dimensional data, which will require no dimensionality reduction.