Lightweight Fault-Tolerance
As distributed computing
becomes commonplace, and many more applications are faced with the
current costs of high availability, there is a fresh need for
recovery-based techniques that combine high performance during
failure-free executions with fast recovery. However, although the
literature contains approximately 300 papers in this area, rollback
recovery is seldom used in practice to build reliable distributed
applications. The Lightweight Fault-Tolerance (LiFT) project
focuses on changing this state of affairs with an approach that blends
algorithmic work, systems building, and empirical analysis.
Highlights
- Causal Logging Protocols
We have developed the first formal specification of
the consistency condition common to all rollback recovery protocols,
and we have derived from it derived from it Causal Logging, a novel
technique that eliminates the traditional performance tradeoffs
between pessimistic and optimistic protocols. Causal logging protocols
perform as well as optimistic protocols during failure-free
executions, but, like pessimistic protocols, never roll back correct
processes during crash recovery.
- The Egida Toolkit
Research in rollback recovery has long suffered from
a flourishing of algorithmic results that are rarely supported by
careful experimental evaluation of their practical significance. As a
result, the performance of these algorithms in practice is not well
understood, and little attention has been given to simplifying the
difficult task of integrating rollback recovery protocols with
applications. To address these issues, wehave developed the
Egida toolkit. Egidas design addresses for the first time
the fundamental problem of characterizing the set of functionalities
that are at the core of all message-logging protocols. This
characterization, based on a framework for handling non-determinism in
a process execution, gives Egida the expressiveness to encompass the
diversity of rollback recovery protocols. A protocol is specified
using a simple, high-level language; the protocols
implementation is synthesized from this specification by gluing
together appropriate modules from a library. As a result, Egida allows
the implementation of arbitrary rollback recovery protocols with
minimal programming effort.
- Understanding the Cost of Recovery
Using Egida, we have performed the first study of the recovery
performance of message-logging protocols. This study has revealed that
no existing protocol can simultaneously guarantee low overhead during
failure-free execution, fast recovery, and fault-containment, leaving
applications to face a complex tradeoff. To eliminate this tradeoff,
we have developed a new class of protocols that never roll back
correct processes, recover quickly (within 2% of the protocol with the
best recovery time) and impose little overhead during failure-free
execution (within 2% of the protocol with the best failure-free time).
- An analysis of Communication Induced
Checkpointing
- Efficient support of file
I/O. Traditional rollback recovery techniques treat the file
system as part of the "outside world". As a result, processes may be
forced to execute a blocking output commit protocol whenever
they interact with the file system.
We have derived a new protocol that integrates records and
efficiently replicates the information necessary to reproduce file I/O
operations during recovery. Our simulation studies show that this
approach eliminates all synchronous logging to stable storage, thereby
reducing the cost of performing file I/O dramatically.
To confirm
the simulation results, I am building a new
file system based on these protocols.
Current Focus
-
A Fault-Tolerant JVM
Fault-Tolerance and Security. While the need
for protecting applications from security attacks is universally
recognized, little attention has been given to the problem of securing
the software that applications rely upon for fault-tolerance. This
problem is especially acute for rollback recovery protocols, in which
a malicious party who alters the information used during recovery can
affect the state to which a faulty process is restored and introduce a
Trojan horse. And, denial-of-service attacks can force a process to
fail. We plan to exploit Egidas extensibility to design and
evaluate new secure rollback recovery protocols.
-
Support for self-tuning fault-tolerance
protocols. Currently there are no simple guidelines that can help
even the experts in choosing the most efficient protocol for a given
application in a given execution environment. We plan to leverage the
simplicity with which different protocols can be implemented within
Egida to develop the understanding necessary to articulate these
guidelines. Our goal is to use these insights to allow Egida to monitor
the execution environment and the application behavior and to select
automatically the fault-tolerance protocol that best suits them.