The existing distributed graph and matrix analytics frameworks are designed with data-intensive workloads in mind, rendering them inefficient for compute-intensive applications such as graph mining and scientific computing. The goal of this project is to develop novel big data frameworks for two compute-intensive tasks, graph mining and matrix/tensor computations, respectively. The two frameworks advance the field of big data analytics by motivating future systems for compute-intensive analytics, and promoting their application in various scientific areas to improve research productivity. The two systems will be available for public use, and can serve several cross-disciplinary projects in computer forensics, computational physics, and bioinformatics. The project includes mentoring graduate students and training K-12 students through summer internships, as well as related new course materials and outreach activities to help the public learn big data technologies. Thus, the project aligns with the NSF's mission to promote the progress of science and to advance the national health and prosperity.

The graph mining system and the matrix/tensor platform share the design of (i) a tailor-made storage subsystem providing efficient and flexible data access, and (ii) a computation subsystem with fine-grained task control for data-reuse-aware task assignment and load balancing. The graph mining system, called G-thinker, aims to facilitate the writing of distributed programs which mine from a big graph those subgraphs that satisfy certain requirements. Such mining problems are useful in many applications like community detection and subgraph matching. These problems usually have a high computational complexity, and existing serial algorithms tackle these problems by backtracking in a duplication-free vertex-set numeration tree, which recursively partitions the search space. G-thinker adopts an intuitive programming interface that minimizes the effort of adapting an existing serial subgraph mining algorithm for distributed execution. The subgraphs to mine are spawned from individual vertices and they grow their frontiers as needed, and memory overflow is avoided by spilling subgraphs to disks when needed. In each machine, vertices and edges shared by multiple subgraphs need only be transmitted and cached once, which minimizes communication (and hence data waiting) so that CPU cores are better utilized. To address the load-balancing problem of power-law graphs, G-thinker explores recursive decomposition and work stealing to allow idle machines to steal subgraphs for mining from heavily-loaded machines. The project also explores a distributed matrix/tensor storage and computing framework, where matrix/tensor partitions are stored in multiple replicas using different storage schemes to efficiently support all kinds of submatrix access operations. This flexible storage scheme offers the upper-layer computations much more opportunities for fine-grained optimizations, including smarter task scheduling and in-situ updates. The use of this framework is exemplified by matrix multiplication and LU factorization. Both of the proposed frameworks can help build a cyberinfrastructure for collaborations with scientists in science, medicine, and industry.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1755464
Program Officer
Alan Sussman
Project Start
Project End
Budget Start
2018-06-01
Budget End
2022-05-31
Support Year
Fiscal Year
2017
Total Cost
$170,941
Indirect Cost
Name
University of Alabama Birmingham
Department
Type
DUNS #
City
Birmingham
State
AL
Country
United States
Zip Code
35294