A wealth of digital information is being generated through social networks, blogs, online communities, news sources, and mobile applications as well as a myriad of device-based sources such as smart-home devices and wearable sensors. Data analysts in a number of domains, e.g., government, public health, national security, and public safety, stand to benefit greatly from the ability to perform retrospective as well as interactive analyses over such data. The key feature of this data is that an individual item, such as a tweet or a sensor reading, is low-value by nature. Such data becomes of high-value only when large quantities of such data are analyzed together. This project seeks new data management techniques to enable data analysts to process large quantities of such low-value data. The key challenge is to support analytic queries efficiently and interactively, while being aware of the low-value nature of the data, using cost-effective solutions such as cheap commodity hardware.

Support for data analytics has been well studied, both for centralized and parallel databases, for tabular data. However, given memory prices where the high-value transactional data for a typical enterprise can fit in the memory of a high-end server, most recent work has been on analytics for memory-resident data. In contrast, this project aims to support analytics over data arising from social, mobile, Web, and IoT data sources. This data is much larger, so memory-residence is not cost effective for storage or analysis, as only in aggregate do the data items become high-value. The project has three main thrusts. The first thrust focuses on efficient storage and resource-aware query processing for large volumes of data that are nested, semi-structured, and lacking a predefined schema. The second thrust introduces a flexible join framework to handle complex join queries – including joins over spatial, temporal, and textual data – to allow multiple datasets to be combined to increase their value. The third thrust, since big low-value often involves sequences of events, focuses on efficient window query processing; parallel processing of window queries, in order to scale, is essential for big low-value data analytics.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2019
Total Cost
$600,000
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697