|
Extensible Toolkit for Low-Overhead Fault-Tolerance
We have designed and implemented Egida, an object-oriented
toolkit designed to support transparent
rollback-recovery. Egida exports a simple specification language that can be
used to express arbitrary rollback recovery protocols.
From this specification, Egida automatically synthesizes an implementation of
the specified protocol by gluing together the
appropriate objects from an available library of ``building blocks''.
Egida is extensible and facilitates rapid implementation of
rollback recovery protocols with minimal programming effort. We have integrated
Egida with the MPICH implementation of the MPI standard. Existing MPI
applications can take advantage of Egida without any modifications:
fault-tolerance is achieved transparently---all that is needed is a simple
re-link of the MPI application with Egida.
Representative Publication:
-
S. Rao, L. Alvisi, and H.M. Vin, Egida: An Extensible Toolkit
For Low-overhead Fault-Tolerance,
In Proceedings of IEEE
International Conference on Fault-Tolerant Computing (FTCS),
pp. 48-55, June 1999
[
Abstract |
Paper ]
| |