CSR: Small: Monitoring for Error Detection in Today?s High Throughput Applications

Abstract: Much of our critical infrastructure is formed by distributed systems with real-time requirements. Downtime of a system providing critical services in power systems, air traffic control, banking, and railways signaling could be catastrophic. The errors may come from individual software components, interactions between multiple components, or misconfiguration of these components. It is therefore imperative to build low latency detection systems that can subsequently trigger the diagnosis and recovery phases leading to systems that are robust to failures. A powerful approach for error detection is the stateful approach, in which the error detection system builds up state related to the application by aggregating multiple messages. The rules are then based on the state, thus on aggregated information rather than on instantaneous information. Though the merits of stateful detection seem to be well accepted, it is difficult to scale stateful detection with an increasing number of application components or increasing data rate. This is due to the increased processing load of tracking application state and rule matching based on the state. In this project, we address this issue through designing a runtime monitoring system focused on high throughput distributed applications. Our solution is based on intelligent sampling, probabilistic reasoning on the application state, and opportunistic monitoring of the heavy-duty rules. A successful solution will allow reliable operation of high bandwidth distributed applications and those with a large number of consumers. We will also achieve broader impact through an innovative service learning program at Purdue called EPICS and a new course.

Project Report

Our society increasingly depends on large-scale distributed applications for critical operations. They range from the air traffic control system, online financial processing applications, prediction of impending natural disasters, to running our educational computer labs with little to no downtime. To reduce the downtime of this gamut of applications, it is desirable to automatically pinpoint what the root cause of a failure is, and, whenever possible, to predict impending failures based on observed symptoms in the system. The first objective will let either an automated system or a human being quickly diagnose the problem and initiate a mitigation action. The second objective will prevent the end user (i.e., us) from ever being affected by the fault. The faults can be diverse in their origin – hardware problems (e.g., a network fiber being disconnected), software problems (e.g., an inability of the software to handle a large number of concurrent users), or configuration problems (e.g., the software is configured for only high speed links, while in practice it also experiences slow links). Therefore, the problem localization and prediction must take into account such diversity. The granularity with which the localization can be done depends – it may be to a single compute node (in a large-scale parallel application), to an executing process, or to a region of code. In this project, we designed mechanisms to perform problem localization and failure prediction by monitoring and analyzing a wide variety of metrics from all layers of the system stack. Examples of metrics are CPU utilization (from the system layer), frequency of garbage collection (from the middleware layer), and number of active threads (from the application layer). The intuition is that when the system is performing normally, the metrics will exhibit some pattern, either individually or in groups. An example of the former is that the rate of I/O will stay between certain thresholds; an example of the second kind is that the rate of I/O is correlated with the rate of user requests. Our project’s novel contribution was to develop techniques to automatically "learn" legitimate patterns among groups of metrics (the second kind). Then it would monitor the patterns during the execution of the system and flag any significant deviation from the learned patterns. Typically there would be multiple learned patterns because a system can behave in one of several different manners depending on the kind of workload executed on the system. From the deviation, our system would identify, in a probabilistic manner, what was the root cause of the problem. Our system, called Orion, uses the above steps to find the abnormal window of time, abnormal metrics and abnormal code regions where a fault is manifested. We evaluated Orion on two classes of distributed applications: Commercial applications: (i) client-server multi-tier applications in which the presentation, the application processing, and the data management are logically separate processes. Example of these architectures are the Java Enterprise Edition (Java EE) standard; (ii) MapReduce programming model for processing large data sets. MapReduce is typically used to do distributed computing on clusters of computers. High Performance Computing (HPC) applications: scientific and engineering applications that run in large clusters of machines with parallel tasks that communicate with the message passing interface (MPI). Intellectual Merit Intellectual merit in the project derived from the machine learning-based algorithms used to correlate multiple metrics to determine normal and anomalous patterns, designing the algorithms to be scalable to a large number of machines and to large data sets, and making the algorithms operate in near real-time so that end-user visible failures can be completely avoided, or their duration reduced. For the parallel computing domain, we introduced a way of modeling process behavior statistically using semi-Markov models which are analyzed using scalable clustering and nearest-neighbor methods to pinpoint the location of errors. Our tool called AutomaDeD runs at the scale of the largest supercomputing clusters localizing problems online. Broader Impact The software and the datasets from the project have been publicly released, on Github. The fork of the project applied to HPC applications is being used within Lawrence Livermore National Lab (LLNL) for diagnosing problems in their large-scale clusters of machines, and is being released as an officially maintained source code by LLNL. The PI gave tutorials on the topic at IBM in Austin on two occasions. 4 graduate students and 2 undergraduate students were involved in the design and the experimentation. Among our collaborators, most significantly, the IT organization at Purdue, called ITaP, was influenced to adopt some of our techniques for their operations. We published the results of the research at the top conferences, from two different domains, the dependability domain (DSN, SRDS, Middleware, ISSRE) and the supercomputing domain (Supercomputing, PACT).

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0916337
Program Officer
M. Mimi McClure
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$275,000
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907