In-network computing technologies, or the ability to offload significant portions of compute, communication, and I/O tasks to the network, have emerged as fundamental requirements to achieve extreme scale performance for end applications in the areas of High-Performance Computing (HPC) and Deep Learning (DL). Unfortunately, current generation communication middleware and applications cannot fully take advantage of these advances due to the lack of appropriate designs in the middleware-level. This leads to the following broad challenges: 1) Can middleware that are ?aware? of the computing capabilities of these emerging in-network computing technologies be designed in the most optimized manner possible for HPC and DL applications?, and 2) Can such a middleware be used to bene?t end applications in HPC and DL to achieve better performance and portability? A synergistic and comprehensive research plan is proposed to address the above broad challenges with innovative solutions. The proposed framework will be made available to collaborators and the broader scientific community to understand the impact of the proposed innovations on next-generation HPC and DL middleware and applications. Several graduate and undergraduate students will be trained under this project as future scientists and engineers in HPC. The proposed work will enable curriculum advancements via research in pedagogy for key courses at The Ohio State University. Tutorials and workshops will be organized at various conferences to share the research results and experience with the community. The project is aligned with the National Strategic Computing Initiative (NSCI) to advance US leadership in HPC and the recent initiative of the US Government to maintain leadership in Artificial Intelligence (AI.)

The proposed innovations include: 1) Designing scalable communication primitives (point-to-point and collectives) for using emerging switch and NIC based in-network computing features, 2) Exploiting in-network computing features to o?oad complex and user de?ned functions, 3) Designing high-performance I/O and storage subsystems using NVMe over Fabrics, 4) Designing enhanced in-network datatype processing schemes for MPI library, 5) Designing and optimizing in-network computing-based solutions for emerging cloud environment, and 6) Carrying out integrated development and evaluation of the proposed designs with a set of representative HPC and DL applications. The proposed designs will be integrated into the widely-used MVAPICH2 library and made available to the public. The project team members will work closely with collaborators to facilitate wide deployment and adoption of released software. The transformative impact of the proposed research is to achieve scalability, performance, and portability for HPC and DL frameworks/applications by leveraging emerging in-network computing technologies.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
2007991
Program Officer
Robert Beverly
Project Start
Project End
Budget Start
2020-07-01
Budget End
2023-06-30
Support Year
Fiscal Year
2020
Total Cost
$500,000
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210