Modern big data management systems support fast read and write operations based on the unique identifier (key) of a record. That is, they are fast when inserting key-value pairs, and given a key they quickly return the value associated with that key. To do so, most such systems rely on a Log-Structured-Merge Tree (LSM) structure that batches writes together before writing them to persistent storage. This project will study how to efficiently support more sophisticated operations on LSM-based storage systems, that is, operations that do not simply specify the key of a record. Examples of such operations include searching for records based instead on their location or time. By optimizing the storage and management of big data, this project has the potential to cut the storage costs and energy consumption in data centers. Further, the successful completion of this work will allow users to manage more data with the existing hardware infrastructure, which is critical given the new wave of big data being generated by sensors and the Internet-of-Things. The project will capitalize on the student diversity at two Hispanic Serving Institutions, and thus broaden the participation of under-represented groups in the research process.

To support richer data modeling and querying capabilities on top of LSM key-value stores, this project will develop novel LSM indexing and access algorithms to support query plans that utilize both primary and secondary LSM components. In addition, it will design and evaluate flow control policies to dampen or eliminate the notoriously bursty data ingestion behavior that LSM-based storage structures exhibit. It will also study how to automatically and dynamically change LSM compaction policies and parameters based on the query workload. Data-semantics-aware compaction techniques will also be studied. The project will additionally develop novel LSM-aware query optimization techniques; the LSM storage layer is currently treated as a black box by most query optimizers. The planned methods will be deployed and evaluated on the open source Apache AsterixDB system.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1838222
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2019-01-01
Budget End
2022-12-31
Support Year
Fiscal Year
2018
Total Cost
$1,390,073
Indirect Cost
Name
University of California Riverside
Department
Type
DUNS #
City
Riverside
State
CA
Country
United States
Zip Code
92521