As distributed computing becomes commonplace, and many more applications are faced with the current costs of high availability, there is a fresh need for recovery-based techniques that combine high performance during failure-free executions with fast recovery. However, although the literature contains approximately 300 papers in this area, rollback recovery is seldom used in practice to build reliable distributed applications. The Lightweight Fault-Tolerance (LiFT) project has focused on changing this state of affairs with an approach that blends algorithmic work, systems building, and empirical analysis.
Although the details of our solution are specific to TCP, the architecture that we propose is sufficiently general to be applicable in principle to other connection-oriented network protocols.