Big data today is stored in a distributed fashion across many different machines or data sources. This poses new algorithmic and system challenges to performing efficient analysis on the full data set. To address these difficulties, the PIs are building the MIDDLE (Mergeable and Interactive Distributed Data LayEr) Summarization System and deploying it on large real-world datasets. The MIDDLE system builds and maintains a special class of summaries that can be efficiently constructed and updated while still allowing fine-grained analysis on the heavy tail. Mergeable summaries can represent any data set with a guaranteed tradeoff between size and accuracy, and any two such summaries can be merged to create a new summary with the same size-accuracy tradeoff.
Interactive summaries can be quickly adapted to a specified query range of data while maintaining the same size-accuracy tradeoffs relative to the data in that range. This allows accurate efficient analysis to zero-in on small subsets of big data. The MIDDLE system enables different big data users to develop a wide spectrum of efficient and scalable data analytic tasks through the use of data summaries. The MIDDLE system is being evaluated and refined with the aid of domain experts. Since the prospect of data-summary-based analytics becoming a part of standard techniques in processing big data is tantalizing, this research generates broader impacts on the nation's government agencies, research institutes, education system, and high-tech industries. Our broad impacts also extend to academia and community outreach, through the design and development big data curriculum and education, and the involvement of general public in understanding and using big data through concise summaries.