Mass quantities of digital information are being generated today on a daily basis through social networks, blogs, online communities, news sources, and mobile applications as well as our increasingly sensed surroundings. Tremendous insight can be gained by storing and making such big data available for exploration in a wide variety of domains. Beneficiaries include business, social sciences, public health, national security, political science, public safety, medicine, and government policy. Researchers exploring these benefits need software to store, index, manage, and analyze big data, while researchers investigating new technical approaches for managing and analyzing big data can benefit tremendously from the availability of shared building blocks to use as a foundation for their efforts. Over the past ten years the Apache AsterixDB scalable big data management system has been developed to address this need. Apache AsterixDB provides a repository for semi-structured data that cannot be organized in tables. In contrast to most other systems in its space, it supports a user-friendly query language which is more powerful than traditional database systems. In contrast to big data analytics offerings, it manages data and exploits knowledge of data layouts and indexes to process queries efficiently. AsterixDB is enjoying use for teaching and research on big data platforms, semi-structured data, and social data analytics. Based on user feedback, this project will enhance AsterixDB to better meet community needs, including improved text handling, numerous query processing improvements, additional standard-based geospatial data support, user-defined functions for user-provided logic, and a variety of storage-level improvements to increase the system's storage, indexing, data ingestion, and integration with other systems. The planned improvements will benefit the broader public by providing a general purpose foundation for extracting high-value insights from high-volume, low-value big data in areas such as public safety and health. In addition to enabling computer and information science and engineering research on big data management, Apache AsterixDB will train students nationwide in big data management and analysis; such training is crucial to addressing the information explosion due to social media, the mobile Web, and Internet of Things (IoT).

Apache AsterixDB is a highly scalable big data Management System (BDMS) that stores, indexes, and manages semi-structured data, e.g., much like MongoDB, but it supports a full query language with the expressiveness of SQL and more. Unlike analytics engines such as Apache Hive or Spark, it stores and manages data, so it can use knowledge of data partitioning and index availability to avoid scanning data sets to process queries. Core features of the system include: a NoSQL-style data model based on extending JavaScript Object Notation (JSON); a declarative query language (SQL++) for semi-structured data; a query execution engine, Apache Hyracks, for partitioned-parallel query execution; partitioned data storage and indexing for efficient ingestion of new data; support for querying external data as well as data stored in AsterixDB; a rich set of data types, including spatial, temporal, and textual data; indexing via B+trees, R-trees, and inverted keyword indexes; and, transactional support akin to that of other NoSQL stores. AsterixDB began in 2009 as a large research project to combine the best ideas from the parallel database world, the Apache Hadoop world, and the semi-structured data world to create a next-generation BDMS; it was accepted into the Apache Software Foundation's incubator in February 2015, and it became a top-level Apache project in April 2016. AsterixDB has enjoyed use for teaching and research on big data platforms, semi-structured data, and social data analytics. Based on user feedback, we propose to enhance AsterixDB to better meet community needs by adding: (1) Improved text handling, including multiple tokenizers, stemming, and stop words. (2) Query optimization improvements, including statistics and cost-based decisions (e.g., join methods and index selection). (3) Query processing improvements, including dynamic range partitioning, fully parallel sorts, merge joins, and skew-handling. (4) Enriched, standardized spatial data support based on GeoJSON. (5) Support for user-defined functions in more languages, especially Python. (6) Support for parameterized queries and prepared statements. (7) Support for indexes on multi-valued fields. (8) Storage efficiency improvements. (9) Data ingestion improvements. (10) Additional formats for external data sets, including Parquet from Spark/Hive.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1925610
Program Officer
Wendy Nilsen
Project Start
Project End
Budget Start
2019-09-01
Budget End
2022-08-31
Support Year
Fiscal Year
2019
Total Cost
$1,140,000
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697