Neural networks have become the go-to tool for solving many real-world recognition and classification problems in computer vision, language processing, life sciences and finance. While promising, smart and intelligent data interpretation via deep learning is extremely power hungry. To conduct power-efficient deep learning on battery-constrained edge platforms, one promising solution is to use hardware accelerators built with emerging non-volatile memory (NVM) devices, which offer high density, extremely low power consumption, as well as in-situ and parallelized data processing. While these advances are enticing, NVM devices also impose extra challenges, as their design and manufacturing technology are far less mature than CMOS. Furthermore, NVM technologies are likely to exhibit new types of errors, such as read/write disturbance, values drifting over time, and short data retention time. These errors can accumulate while the accelerator is running a deep learning application, and without careful mitigation could lead to significant accuracy degradation. To assuage these concerns, this project will develop a self-healing framework for NVM-based neural network accelerators integrating a test, diagnosis, and recovery loop that monitors and maintains the health of the accelerator. Results of this project will (1) deepen the understanding of interactions among hardware defects and errors, NVM-based accelerators, and machine learning, (2) increase community awareness of post-fabrication error debugging and fixing techniques, (3) enrich the computer engineering course curriculum, and (4) train and promote students of diverse backgrounds for both the workforce and research.

This project will investigate, characterize, and mitigate errors that will affect the adoption of NVM-based neural network accelerators. While existing solutions focus on fixing errors observed at fabrication time, this project targets the NVM-specific errors that will occur over the life of the accelerator, not just at the time of manufacturing. The project will lead to four outcomes, namely, (1) measurement and characterization of the error resilience capability of neural networks with different topologies and data types, (2) cost-effective approaches for deploying neural networks alongside NVM-based accelerators which exhibit new and diverse error patterns without involving costly retraining, (3) methods for generating neural network inputs as test vectors which will be tuned to be sensitive to different levels of error accumulation and accuracy loss and will provide real-time accelerator health statistics, and (4) an algorithm and device level co-diagnosis procedure which identifies and protects the most critical and vulnerable components of the neural network and the accelerator.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-10-01
Budget End
2020-02-29
Support Year
Fiscal Year
2019
Total Cost
$235,000
Indirect Cost
Name
Florida International University
Department
Type
DUNS #
City
Miami
State
FL
Country
United States
Zip Code
33199