Byzantine Fault-Tolerance for
Distributed Storage Systems

Data replication is a well-known means of protecting against data unavailability or corruption in the face of data server failures. When servers can suffer Byzantine (i.e., arbitrary) failures, the foremost approach for protecting data is via state machine replication, in which every correct server receives and processes every request in the same order, thereby producing the same output for each request. If the client then accepts a value returned by at least t+1 servers, then up to t arbitrary server failures can be masked. Numerous systems have been built to support this approach (e.g. Mike Reiter's Rampart).

To improve the efficiency and availability of data access while still protecting the integrity of replicated data, the use of quorum systems has been proposed. Quorum systems are a family of protocols that allow reads and updates of replicated data to be performed at only a subset (quorum) of the servers. In a t-masking quorum system, the quorums of servers are defined such that any two quorums intersect in at least 2t+1 servers. In a system with a maximum of t faulty servers, if each read and write operation is performed at a quorum, then the quorum used in a read operation will intersect the quorum used in the last preceding write operation in at least t+1 correct servers. With appropriate read and write protocols, this intersection condition ensures that the client is able to identify the correct, up-to-date data.

Unfortunately, however, there are obstacles to the use of masking quorum systems as an economical alternative for Byzantine fault tolerance. For one thing, they remain expensive in that tolerating multiple faults requires large quorums. Furthermore, the existing read and write protocols for these systems do not provide atomicity of operations in the general case.
The purpose of this project is to address these obstacles.

Current Focus

Under construction (but see recent papers below)

Past Highlights

Achieving Atomic Semantics. We have recently shown how the problem of atomic read and write operations on a masking quorum system can be reduced to that of "regular" operations (in which reads are serializable with respect to writes, but not necessarily with respect to each other) and developed a partial solution to the latter problem.
Managing Quorum Membership. We are developing non-blocking protocols for adding new servers to a quorum system and removing those known to be faulty. In addition to enhancing efficiency, such protocols will reduce the system's dependence on pre-defined tolerance thresholds, thus improving its suitability for tolerating non-random faults such as security breaches.
Dynamic Quorum Systems. We have designed protocols to change the size and/or structure of the quorums so that the system's degree of fault tolerance ("threshold") can be adjusted between a set minimum and maximum without blocking ordinary read and write operations. This allows smaller quorums to be used when the number of faults is believed to be low.
Fault Detection. We have designed statistical methods for monitoring the approximate number of faults in certain types of masking quorum systems, as well as specifically identifying some of the faulty servers.

Project Members

Collaborators

Dahlia Mahlki (Hebrew University)
Mike Reiter (Carnegie Mellon University)

Alumni

Lyn Pierce (Ph.D)

Relevant Publications

Minimal Byzantine Storage (with J.P. Martin and M. Dahlin). In Proceedings of the 16th International Symposium on Distributed Computing (DISC 2002), Toulouse, France, October 2002, pp. 311-326.

Small Byzantine Quorum Systems (with J.P. Martin and M. Dahlin). In Proceedings of the International Conference on Dependable Systems and Networks (DSN 2002 and FTCS 32), DCC Symposium, Washington, DC, June 2002, pp. 374-383. See also the extended technical report.

A Framework for Semantic Reasoning about Byzantine Quorum Systems. (with E. Pierce) In Brief Announcements, Proceedings of Symposium on Principles of Distributed Computing, August 2001.

Dynamic Byzantyne Quorum Systems. (with D. Mahlki, L. Pierce, M. Reiter, and R. Wright) In Proceedings of the International Conference on Dependable Systems and Networks (FTCS 30 and DCCA 8), New York, NY, June 2000, pp. 283-292.

Probabilistic Techniques for Fault Detection in Byzantine Quorum Systems. (with D. Mahlki, L. Pierce, and M. Reiter). Proceedings of the Seventh IFIP International Working Conference on Dependable Computing for Critical Applications (DCCA-7) San Jose, CA, January 1999, pp. 379-394.

Byzantine Fault-Tolerance for Distributed Storage Systems