There is a growing need for effective approaches to mining very large, i.e., petabyte scale data sets in many areas of science, engineering, and business.

The project aims to design, analyze, and implement a number of fundamental matrix-mining and graph-mining operations that are scalable to petabyte-sized inputs. Such efforts guarantee the continuation of the phenomenal growth in analyzing, visualizing, and extracting information from massive matrices and graphs. Project leverages Rensselaer's unique computing platform in the form of a massively parallel machine (a Blue Gene/Q) with access to approximately 1.2 petabytes of storage, as well as a data-staging layer, named the RAM Storage Accelerator (RSA) with 512 computational nodes and a a total of 8TBs of fast RAM. The platform is configurable to allow the computational nodes at the RSA level to be used to pre-process data from the secondary storage in a cloud-like fashion. The project aims design and analyze approximation algorithms for matrix and graph mining tasks that follow an iterative, two-step approach: given petabytescale data, first, using computationally inexpensive approaches to obtain compact data sketches using the RSA layer as a "cloud" in order to reduce their size from the petabyte scale to the terabyte scale. The resulting data sketches are processed using computationally demanding approaches on the Blue Gene/Q. This process is iterated using the approximate solutions in order to improve the quality of the sketches and the approximation guarantees.

The research team expects to release software and libraries for matrix and graph mining algorithms that implement our two-phase approaches for PB-scale matrices and graphs. The resulting tools will be applied to the analysis of petabytes of data from computer simulations of the dynamics of biomolecular systems. The investigators plan to involve students and researchers from other institutions in the design, analysis, and development of the proposed methods through an internship program. The project also offers increased opportunities for research-based training in Data Analytics and High Performance Computing to graduate and undergraduate students at RPI. The results of the research will be made available to the academic community through the project web site.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1302231
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2013-09-01
Budget End
2019-08-31
Support Year
Fiscal Year
2013
Total Cost
$1,000,000
Indirect Cost
Name
Rensselaer Polytechnic Institute
Department
Type
DUNS #
City
Troy
State
NY
Country
United States
Zip Code
12180