Project Summary
The characteristics of distributed applications that
desire fault-tolerance is changing. In the past, fault-tolerance
was an important requirement of mission-critical applications with the
primary concerns being continous availability as well as the
ability to tolerate arbitrary failures; the associated costs and
the overhead imposed by fault-tolerance techniques were secondary
concerns. In contrast, most of the emerging distributed applications are
not necessarily mission-critical and desire fault-tolerance techniques
that (1) impose minimal overhead on failure-free execution, (2) provide
fast crash recovery from common-case failure scenarios, (3) use
few dedicated resources, and (4) can be transparently integrated
with applications. To meet these requirements, the goal of
the lightweight fault-tolerance (LiFT) project is to develop a new
framework using rollback-based recovery protocols (such as message-logging
and checkpointing) for applications in which processes communicate
by messages, files, or a combination of the two.
Key Results:
Team Members: Phoebe Weidman, Ravishankar Chamarajnagar, Jeff Napper,
Stefano Masini, Lorenzo Alvisi, Harrick M. Vin (Alumnus: Sriram S. Rao)
See Also: Lorenzo Alvisi's research web page on
Lightweight Fault-Tolerance
|