Machine Learning (ML), particularly Deep Learning (DL), has gained great success in recent years, especially with the use of Deep Neural Networks (DNNs) of different types. Varied DNNs serve as the state-of-the-art foundation and core enabler of many key applications, such as robotics, high-quality video stream processing, augmented reality, wearable devices, smart health devices, etc. Achieving high accuracy typically requires DNNs with large and complex model structures, which also translates into high computing requirements for both training and inference steps. Accelerating the training process on a modern High-Performance Computing (HPC) node or cluster and inference process on a lower-end power-efficient device have both emerged as major challenges. This project focuses on this problem, viewing DNN training and inference as HPC workloads that need to exploit available multi-level parallelism, complex memory hierarchy, and device heterogeneity; while automating the optimizations through a compiler. If this project succeeds, it will, for the first time, enable real-time machine learning for many edge devices, enabling the greater success of ML-based end applications that are important for the society, economy, and other science and engineering areas. This project will also make several contributions towards both education and improving diversity, including: (1) introducing HPC in an ML course, and ML workloads optimization experience in both undergraduate systems and graduate research courses, particularly with interesting demonstration videos; (2) outreaching to undergraduates with the goal of creating interest in (systems) research, and to K-12 with the goal of attracting underrepresented groups to computer science.
The key idea of this project to address the above challenge is sparsification-compilation co-design. It first introduces a general sparsification idea called fine-grained structured pruning, which prunes the weights according to certain fine-grained structures and preserves non-zero weights in a more regular way. Based on this idea, this project designs a high-level abstraction called layer-wise intermediate representation (IR) to capture the sparsity information with the goal of enabling aggressive compiler optimizations. Building on a successful application of this idea on two-dimensional DNNs, this project undertakes a comprehensive agenda to fully apply the benefits of this approach. First, it unifies Convolutional Neural Networks and Recurrent Neural Networks acceleration with a more general fine-grained structured pruning instance and a set of enhanced compiler-based automatic optimizations. Second, it improves the pruning or retraining process itself by extending the compiler optimizations from inference to pruning and exploiting domain properties to carry-out optimized application-level checkpointing. Third, it extends the (compiler automated) optimization framework to support high-dimensional and extremely deep DNNs. Finally, it explores data reuse across DNNs for situations where multiple DNNs are co-executed on the same device.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.