New computing applications are emerging in smart networks, scientific explorations, business management, security, and healthcare. These applications depend on very large amounts of data. This data must be used in a fast and efficient manner. The use of large supercomputers to analyze such data is on the rise. The techniques they use are referred to as deep learning (DL) high-performance computing (HPC). Researchers are using DL HPC to make sense of this flood of data and obtain useful information. To do this they must redesign HPC systems. A key challenge is how to use resources such as data storage and computer memory at a huge scale. This project will build Metis, a high-performance data storage system that uses new, end-to-end, hardware-supported memory and storage design to meet the needs of DL HPC applications. The goal is to satisfy the challenge posed by increasing data management performance for next-generation supercomputers. The project will connect several different computing communities and increase interactions among them. The project includes educational and engagement activities which will greatly increase the community's understanding of HPC systems. These activities include broadening participation activities to attract and retain new students. Special emphasis will be given to students from underrepresented groups. The project will encourage student interest in design and research in large-scale computing systems design.

This project brings together researchers in micro-architecture, distributed computing systems, namely cloud and HPC systems, storage systems, and power/energy modeling to boost DL HPC data processing performance. The research will yield a fundamentally new software-hardware co-designed memory compression technique that transparently compresses DL application memories with negligible runtime performance overhead. Metis will leverage the novel compression substrate to enable a distributed, intelligent, operating-system-level data cache that effectively exploits the physical memory freed via program-memory compression. The developed techniques will open doors for innovative HPC and scientific applications in a broad range of disciplines, which have not been previously possible. Metis' focus on addressing the challenges of increasing performance in the Exascale era, along with engaging researchers from multiple areas, aligns it very well with the goals and objectives of the SPX program. Additionally, the research will also create new knowledge on design principles of memory compression, and yield insights to provide seamless integration of DL applications into the next-generation DL-aware supercomputer infrastructure.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2019
Total Cost
$952,884
Indirect Cost
City
Blacksburg
State
VA
Country
United States
Zip Code
24061