Current relational database systems (RDBMSs) were engineered for the business data processing market, and not for scientific users (e.g, astronomers, physicists, chemists, oceanographers, or earth scientists). By and large, science users either grumble and use RDBMSs or, more often, "roll their own" data management software. Significant projects, such as the Large Hadron Collider (LHC) and the NASA Mission to Planet Earth, have spent millions of dollars on custom software systems, with limited applicability to other projects. As such, after a generation of science applications, there is limited shared data management infrastructure.

SciDB is a project focused on building an open-source DBMS focused on the needs of science users. We have developed the requirements of SciDB based on a close collaboration with a number of scientists. Data management features include a nested array data model (rather than the tabular model of RDBMSs) with operations attuned to scientific data, a no-overwrite storage model (allowing interaction with historical results), and support for uncertainty, named versions and provenance information.

At this point, there is a distributed team of 17 programmers and scientists working actively on the design and implementation of SciDB, assisted by an advisory committee of 15 scientists. This team is primarily focused on the research issues surrounding the design of SciDB, and most contributors are involved in the project as volunteers or are paid by their individual organizations.

A demo of the first working proof-of-concept SciDB prototype was given at the VLDB conference in August 2009, and a first public release of SciDB is planned for September 2010. The purpose of this NSF grant is to enhance SciDB with additional science-oriented features, including time travel, versions, uncertainty and provenance. With NSF?s help, we expect to develop a full-function system by the end of the grant period.

Project Report

During this project we have worked on three aspects of scientific data base support, namely supporting versions and provenance, optimizing joins and performing scalable visualization of scientific data. Our work on versions and provenance included building a versioning system for SciDB, an novel array DBMS. Here, we demonstrated that one need not discard old data, but it can be kept indefinitley, even if updated by using clever "delta" techniques. We also worked on two innovative provenance tracking systems for arrays that would allow users to answer the question "where did a particular data object come from?". This is useful when scientific users see suspect data, in which case they can efficiently trace the derivation of such data to find root causes. Our work on joins included efficiently supporting joining two arrays together, when their contents are skewed. In most scientific applications data is very unevenly distributed throughout the cells of an array, and new algorithms are needed to efficiently couple skewed data. Our visualization efforts focused on rendering very large data sets, which cannot fit in their entirety on the screen. We explored a pan/zoom interface and worked on architectural issues for supporting it. These included precomputation, prefetch and caching support.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1047955
Program Officer
Daniel Katz
Project Start
Project End
Budget Start
2010-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$500,000
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139