Modern data-processing systems are designed to be general-purpose systems, in that they can handle a wide variety of applications and data. Unfortunately, this general-purpose nature causes these systems to achieve below-optimal performance for every single application and user. Rather technical compromises have to be made to support a wide range of use cases, often leading to orders-of-magnitude worse performance than what a highly customized system would be able to achieve. At the same time, developing a database system from scratch for each individual application and user is neither economical nor practical. The goal of this project is to explore how machine learning can be used to automatically customize a database system for a specific application or user to achieve so called 'instance-optimality'. If successful, this project will transform the way that modern database systems that underpin the Internet and many enterprise computing systems are built, resulting in systems with much better performance or systems that are able to process large datasets using much less hardware than current systems.

Concretely, the project investigates to what extent learned models can automatically instance-optimize the various components of a large-scale data processing system: 1) data indexing, where a model can predict the location of a key in a database; 2) algorithms, including sorting and joins, where a model can predict where in a sorted list a record should go, or where joining tuples are in another relation; 3) optimizers, where a model can predict the optimal plan to use for processing queries on data, and 4) storage layout, where a model can predict the optimal layout of data for a particular query workload. This raises a number of intellectually deep questions, including what types of models work best, what theoretical guarantees we can give about the performance of these models, how such generated systems will compare to hand-tuned systems, how such systems can exploit new hardware such as TPUs/GPUs and how program synthesis will work with such modelled data, advancing the fields of databases, machine learning, and program modeling and synthesis.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1900933
Program Officer
Wei Ding
Project Start
Project End
Budget Start
2019-09-01
Budget End
2023-08-31
Support Year
Fiscal Year
2019
Total Cost
$1,200,000
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139