An increasing number of applications require the storage of and access to all historical data to support rich analytics, learning, and mining operations. This project develops a series of methods to summarize data so that it can be queried with respect to not just the full data set, as is standard, but with respect to the state of the data set at any historical time. These summaries integrate with large temporal databases, in both offline batched-processing and online streaming application scenarios. The effectiveness of these methods will be demonstrated on an enormous scientific database of atmospheric data collected for 20 years from over 40,000 weather stations. We will work with industry collaborators to help deploy our new algorithms, and the results will be integrated into education and outreach efforts surrounding the growth of data science initiatives.
More specifically, this project extends and combines approximate query processing with temporal big data. In particular, instead of (or on top of) using a multi-version database, this project designs and implements persistent data summaries (PDSs) that offer interactive temporal analytics with strong theoretical guarantees on their approximation quality. In additional to formalizing these models, this project develops practical PDS implementations for sampling-based summaries, data sketches, and core sets that support advanced analytical queries.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.