As society increasingly relies on distributed applications, it is ever more important that these applications tolerate partial failures. However, many users and developers are unwilling to sustain the costs in resources and performance that one pays today for building reliable applications. This research seeks to resolve the tension between the need for fault-tolerance and the costs traditionally associated with it. A new low-overhead approach to fault-tolerance, called lightweight fault-tolerance (LiFT), is investigated. The ideas at the core of LiFT are (1) to determine the information necessary to reproduce during recovery each non- deterministic event in a process execution, and (2) to guarantee that this information is available during recovery by efficiently replicating it in the volatile memory of a sufficient number of processes. LiFT has two main goals: (1) the development of low-overhead fault-tolerance techniques that encompass applications in which communication is through message-passing, shared memory, or any combination of the two. (2) the development of recovery protocols that guarantee at the same time good performance during failure-free executions, fast recovery, and fault- containment.