Data analytics extracts insights from massive datasets, often with the assistance of machine learning techniques. The goal of this project is to allow domain experts, including data scientists, to analyze massive datasets quickly using the most powerful supercomputing systems in the world. The problem is that state-of-the-art data processing algorithms that filter data, summarize results and combine information from different sources have inherent scalability bottlenecks. This project designs hyperscalable data processing algorithms that harness the unprecedented compute, storage and networking concurrency of a high-performance computer. This project also develops an open-source data processing engine to disseminate prototype implementations of these algorithms to the public. Another contribution is the creation of a massively parallel data processing module and associated teaching materials for undergraduate data science curricula, such as the diverse Data Analytics undergraduate major at The Ohio State University.

The confluence of extreme compute parallelism, fast networking and growing memory capacities in the modern datacenter presents an opportunity to design a hyperscalable data processing kernel for warehouse-scale computers. This project sits at the intersection of data management and high-performance computing; it develops scalable join and aggregation algorithms, topology-conscious query planning and optimization techniques, and interference-aware data access methods for shared cold storage. This is accomplished by carefully overlapping communication and computation, identifying and avoiding unscalable all-to-all communication, accounting for network path congestion and variability in remote memory access latency, and judiciously using inter-process coordination to accelerate data ingestion from a massively parallel shared file system. These research activities lay the intellectual foundation to make data analytics scalable and efficient in warehouse-scale computers.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1816577
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2018-07-01
Budget End
2021-06-30
Support Year
Fiscal Year
2018
Total Cost
$460,000
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210