The goal of this project is to fundamentally reinvent the design of the system, from hardware to application, using fast, novel inflight analytics to control and optimize large-scale heterogeneous computer systems to meet the performance and resiliency requirements of emerging applications such as data mining, artificial intelligence, and individualized medicine. Towards that goal, advanced machine-learning (ML) methods along with domain knowledge will be employed to support real-time system-state estimation and decision-making, including resource management, congestion/failure detection and mitigation, preemptive intrusion detection, and configuration management. Innovations across the system stack will be needed to achieve optimal results by taking full advantage of contextual information collected from multiple layers of the system and adapting rapidly to the deployment environment, workloads, and application requirements. ML-driven inflight analytics methods, developed in this effort, will be demonstrated on a heterogeneous “rack-scale†computing system, with the ultimate future objective of scaling up the framework to a warehouse-scale computing system.
The project will be organized around the following research activities. (i) Work with noisy and incomplete telemetry data (e.g., hardware telemetry, OS-level logs, and application-level traces) available from monitors across the system stack to perform system-state estimation (e.g., resource utilization). Telemetry data are often noisy and inconsistent in terms of semantics, modalities, and time granularities, making systems only partially observable. Bayesian deep-learning models will be developed to accurately capture system states and cope with data noise and incompleteness. (ii) Design models and algorithms for practical inflight analytics that make decisions (e.g., on scheduling or failure mitigation) based on the estimated system state to enhance system performance, reliability, and security. Such a framework will consist of an ensemble of interdependent ML models based on partially observable Markov decision processes (POMDPs) augmented with domain knowledge (e.g., interconnect topology) and trained in real time. (iii) Synthesize hardware accelerators for fast, low-cost inflight analytic. Toward that end, a compiler and a runtime framework will be developed that take high-level declarative probabilistic programs (i.e., the POMDPs), automatically compile them onto accelerators, and plan their execution across heterogeneous hardware (FPGAs, ASICs, and CPUs/GPUs). (iv) Assess the trustworthiness of inflight analytics. For that, a trust-assessment framework will be created to evaluate resiliency to failures and attacks due to residual imperfections of heterogeneous components, input uncertainty, and the use of stochastic ML algorithms. While in the planning stage, this project will focus on design of inflight analytics in the context of rack-scale systems. The methods and algorithms developed will be useful in helping smaller-scale sites with limited resources manage their systems more efficiently. Students involved in this project will have a rare opportunity to participate in the design of heterogeneous ML-driven systems with broad applicability. The integration of ML methods and algorithms into real systems can be attractive to a diverse range of individuals, including underrepresented minority students. The goal is to raise awareness of scientific and engineering challenges in design and deployment of next-generation computing systems to support emerging applications.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.