There are two major trends in modern data processing applications that make them distinct from applications in previous decades. The first is that they are noted for their continuously changing data sets. This could come from transactions updating the database or from upstream sources. The second is that they want to analyze the latest obtained data as quickly as possible. Data has immense value as soon as it is created, but that value diminishes over time. Therefore, it is imperative that the queries access the newest data generated in order for their results to have the most impact. The ability to ask complex questions about data as soon as it enters in the database is useful in many application domains, including real-time monitoring systems (e.g., is an incoming packet from a potential attacker?) and financial services (e.g., is this new credit card purchase fraudulent?). But current systems contain architecture remnants of legacy database management systems (DBMSs) that prevent them from taking advantage of newer hardware support for parallel optimizations. This limits the types of queries that an application executes on a DBMS that targets data as soon as it arrives. In turn, this adds additional cost to deploying a database application in terms of both hardware and administration overhead. Thus, the goal of this project is to investigate using query compilation to allow non-invasive analytical operations that are more complex than what is practical in today's DBMSs. Such query compilation techniques are beneficial to a wide array of data processing systems. The results of this study will allow organizations to deploy DBMSs that are able to handle applications with larger data sets and more complex workloads with fewer resources (e.g., hardware, personnel, energy).

Modern data-intensive applications seek to obtain new insights in real-time by analyzing a combination of historical data sets alongside recently collected data. To support such workloads, database management systems (DBMSs) need to support complex analytical queries over diverse data sets. The ever decreasing cost of DRAM is allowing a greater number of these applications to be memory-resident. As such, in-memory DBMSs will be used for most analytical and machine learning applications in the future. But there are remnants of how legacy disk-oriented DBMSs process queries that still exist in newer in-memory DBMSs that inhibit the kind of high-performance query execution over large data sets that this project targets. Thus, the goal of this project is to overcome this barrier through a new holistic approach to query compilation that integrates it comprehensively throughout the DBMS, and which builds upon (and adapts) recent advances in "just-in-time" (JIT) compilation technology and heterogeneous hardware resources. Using compilation to optimize many different aspects of the DBMS's architecture is important to support future "Big Data" applications that need to ingest large amounts of new data while simultaneously executing complex analytical workloads in near real-time.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1718582
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2017-08-01
Budget End
2020-07-31
Support Year
Fiscal Year
2017
Total Cost
$499,774
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213