This project addresses programming challenges posed by the new trend in multicore computing.

Multithreaded programs are difficult to write, test, and debug. They often contain numerous insidious concurrency errors, including data races, atomicity violations, and order violations, which we broadly define to be races. A good deal of prior research has focused on race detection. However, little progress has been made to help developers fix races because existing systems for fixing races work only with a small, fixed set of race patterns and, for the most part, do not work with simple order violations, a common type of concurrency errors.

The research objective of this project, LOOM: a Language and System for Bypassing and Diagnosing Concurrency Errors, is to create effective systems and technologies to help developers fix races. A preliminary study revealed a key challenge yet to be addressed on fixing races that is, how to help developers immediately protect deployed programs from known races. Even with the correct diagnosis of a race, fixing this race in a deployed program is complicated and time consuming. This delay leaves large vulnerability windows potentially compromising reliability and security.

To address these challenges, the LOOM project is creating an intuitive, expressive synchronization language and a system called LOOM for bypassing races in live programs. The language enables developers to write declarative, succinct execution filters to describe their synchronization intents on code. To fix races, LOOM installs these filters in live programs for immediate protection against races, until a software update is available and the program can be restarted.

The greatest impact of this project will be a new, effective language and system and novel technologies to improve the reliability of multithreaded program, benefiting business, government, and individuals.

Project Report

Two trends make multithreaded programs increasingly critical. The first is a hardware trend: the rise of multicore. For years, single-threaded code enjoyed automatic speedup as computer architects steadilymade single-coremultiprocessors faster. Recently, however, this "free lunch is over": power and wire-delay constraints have forced microprocessors into multicore designs, and adding more cores does not automatically speed up single-threaded code. The second is a software trend: the coming storm of cloud computing. More and more users get online, requesting ever richer and more powerful —and thus computation intensive— services. This massive computation is increasingly shifted from the "wimpy," battery-backed devices into the servers in the cloud. An extreme example is Google’s Chrome OS, which provides only a web-browser interface to users and outsources almost all computation to the cloud. To cope with this massive computation, virtually all services today employ threads to increase performance, and likely they will in the foreseeable future. Unfortunately, despite our increasing reliance on multithreaded programs, they remain difficult to write, test, and debug, much more difficult than sequential programs. This difficulty leads to numerous insidious concurrency errors in many widespread multithreaded programs. These errors include data races, atomicity violations, and order violations, which we broadly define to be races. These races may cause "mere" application crashes, or worse, silent data corruptions, or even worse, deaths and the Northeast blackout in 2003. Multithreaded programs are the most widespread parallel programs, yetmany luminaries in computing consider parallel programming one of the top challenges facing computer science. Quoting John Hennessy: "when we start talking about parallelism and ease of use of truly parallel computers, we’re talking about a problem that’s as hard as any that computer science has faced." To improve the reliability of multithreaded programs, a plethora of prior research has focused on race detection. However, detection of a race is only half of the battle. Developers must still diagnose the race reports to distill harmful races from false or benign ones. Moreover, developers must still fix the races. When it comes to race fix and diagnosis, developers are the bona fide solution. Unfortunately, little has been done to help developers fix and diagnose races. Existing race fix and diagnosis systems aim for full automation, an extremely challenging problem. To simplify the problem, existing systems unsurprisingly have to restrict themselves to a small, fixed set of race patterns, which cannot be extended by the rich domain knowledge of developers. Perhaps the most "outrageous" example is that, these tools do not work with order violations as simple as "statement e1 must happen before statement e2 but the code fails to enforce so," because we do not yet have good automated detectors for these order violations. The research objective of this proposal is to create effective systems and technologies to help developers fix and diagnose races. In our preliminary work, we have studied the fix and diagnosis process of real races and identified two key challenges yet to be addressed. First, how to help developers immediately protect deployed programs from known races? Even with the correct diagnosis of a race, fixing this race in a deployed program can still take long. This delay leaves large vulnerability windows, compromising reliability and potentially security. Second, how to help developers create the thread interleavings they want to test? When diagnosing a race or validating a "race fix," developers often knows what thread interleavings may trigger the race. Unfortunately, they have no effective ways of forcing these interleavings, and often resort to strenuous, slow, and error-prone manual methods. To fully achieve our goal, we have gone from "soup to nuts" in building a prototype system and evaluating it on real programs such as MySQL and Apache. Our research contributions will include (1) a thorough study on how developers fix and diagnose races, and working exploits to demonstrate the security risk of races; (2) the first language for developers to express their synchronization intents on code in declarative, succinct ways; (3) the first "live-workaround" system designed to quickly, safely, and efficiently bypass races in live programs; and (4) the first practical tool for checking errors in alias analysis which static race detectors depend upon. Our results show that the techniques we developed significantly improve software reliability and availability, thus benefiting every business, government, and individual.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1117805
Program Officer
Anita J. LaSalle
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2011
Total Cost
$266,000
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027