This project creates a capability that will support the construction of large, data intensive scientific applications that must run on top of national cyberinfrastructure, such as large campus clusters, NSF extreme-scale computing facilities, the Open Science Grid, and commercial clouds. The new capability (DataSwarm) brings data requirements and software dependencies to the target cyberinfrastructure systems, and deploys them as and when required, rather than having these requirements pre-installed on the target systems. The motivation comes from applications in high energy physics, molecular dynamics, and quantum chemistry.

The main motivation of the work is the challenge of scalable computing frameworks. Based on a prior development by the Principal Investigator (Work Queue), the current project provides technical innovation in three areas: (1) Molecular Task Composition. Molecular task composition is used as an abstraction for the precise construction of tasks that require a custom software environment, large data input, and a scratch data area to capture the outputs. By expressing these aspects explicitly instead of implicitly, the project improves the storage efficiency of large numbers of tasks. (2) In-Situ Data Management. In-situ storage management is performed to offset the increased storage consumption likely to occur under molecular task composition, avoiding unpredictable failures of tasks due to storage exhaustion. (3) Precision Provenance. Precision provenance of both data objects and task components enables the efficient re-use of resources across multiple runs, as well as precise incremental changes to complex workflows. For this project, the three key elements addressed are the software environment, input data, and a scratch data area. These elements are usually independently managed; here, they are bound together to form temporary "molecules" for task execution. The three applications included in this project represent three typical types of complex data and complex software dependencies. They include custom late-stage data analysis codes in high energy physics, complex multidimensional optimization, and ensemble molecular dynamics, respectively.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1931348
Program Officer
Amy Walton
Project Start
Project End
Budget Start
2019-09-01
Budget End
2022-08-31
Support Year
Fiscal Year
2019
Total Cost
$562,994
Indirect Cost
Name
University of Notre Dame
Department
Type
DUNS #
City
Notre Dame
State
IN
Country
United States
Zip Code
46556