Commercial and government entities now spend around $10 billion per year on software and hardware systems for managing "data warehouses", which are very large electronic data archives. Despite the size and importance of this marketplace, existing data management solutions can be painfully slow (see, for example, www.tpc.org/tpch/ for recent benchmarking results). It is now possible to spend millions of dollars of hardware and software for a system that still takes hours to answer simple analytic questions. This is unfortunate, because there is much knowledge to be gained by interactive exploration of electronic archives. Very long processing times make it likely that the data will stored away, and never looked at again.

The DBO Database System project is concerned with the design and development of a unique system called DBO. Like traditional relational database systems, DBO can run database queries from start to finish and produce exact answers over very large archives. However, unlike any existing research or production system, DBO uses sampling algorithms to produce a statistical estimate for the final query answer at all times throughout query execution. An example of the sort of estimate produced by DBO is, "There is a 95% chance that the true answer is between $1.75 million an $1.80 million". The longer a user waits, the more accurate the estimate becomes. The potential benefit of such an estimate is that a user can stop execution whenever satisfied with the accuracy of the estimate, which may translate to dramatic time savings during exploratory processing. In this way, the goal of the DBO project is to render interactive data analysis a reality, even over the largest databases.

All scientific and technical materials produced by the project, as well as any software available for download, can be obtained from www.cise.ufl.edu/~cjermain/DBO .

Project Report

Data management systems have been in widespread use for decades, but recently a new class of applications termed analytic processing applications have become prevalent. In analytic processing, data is aggregated into a large data repository (somtimes called a data warehouse) and used for subsequent analysis---figuring out how customers react to sales promotions, how websites are viewed, how and why customers complain, and so on. It is not surprising that database and data management technology designed 30 years ago to ensure atomicity and correctness under concurrent reads and writes don't work that well in terms of performance when it is used to analyze 100 terabytes of analytic data at an aggregate level. This has resulted in significant pain in practice, as well as the development of an entire "NoSQL" ecosystem that attempts to use non-reational tools such as MapReduce to perform large-scale analytics. One way to address the pain is to more towards online query processing. Online aggregation is one well-known type of online query processing. In online processing, the database or data management system makes use of randomized algorithms to come up with a quick guess as to the answer to a query. As the user waits, the guess is refined, until eventually the "guess" is totally accurate as query processing is completed. This has the advantage of allowing the user to stop waiting for the final query answer as soon as the guess is "good enough". The potential benefit should be obvious: if it takes ten hours to get the exact answer, but only five minutes to get a high-quality guess, then we have saved a huge amount of both computer and user time. Analytic queries over a data warehouse are particularly amenable to this sort of approximation because the questions that are asked are almost always statistical in nature. There have been two main research efforts associated with the project. Our first main research effort was concerned with designing, implementing, and evaluating the randomized algorithms that will allow for accurate, statistically meaningful guesses to analytic database queries from startup thru completion. Taken together, the implementation of these algorithms form a system called DBO. Our second main research effort was concerned with implementing and designing the parallel database platform that DBO will run on top of, which we call DataPath. Over the lifetime of the project, we publised papers on both DBO and DataPath, as well as online aggregation in general. We considered issues such as how to process online aggregation queries in a distributed environment (on top of a MapReduce system) as well as how to implement very fast query processing engines on top of modern, inexpensive server machines. In addition, we have released the DataPath code for public use.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1007062
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2009-10-01
Budget End
2013-07-31
Support Year
Fiscal Year
2010
Total Cost
$722,598
Indirect Cost
Name
Rice University
Department
Type
DUNS #
City
Houston
State
TX
Country
United States
Zip Code
77005