Byzantine Fault-Tolerance for
Distributed Storage Systems
Data replication is a well-known means of protecting against data
unavailability or corruption in the face of data server failures.
When servers can suffer Byzantine (i.e., arbitrary) failures, the
foremost approach for protecting data is via state machine
replication, in which every correct server receives and processes
every request in the same order, thereby producing the same output for
each request. If the client then accepts a value returned by at least
t+1 servers, then up to t arbitrary server failures
can be masked. Numerous systems have been built to support this
approach (e.g. Mike Reiter's Rampart).
To improve the efficiency and availability of data access while still
protecting the integrity of replicated data, the use of quorum
systems has been proposed. Quorum systems are a family of protocols
that allow reads and updates of replicated data to be performed at
only a subset (quorum) of the servers. In a t-masking quorum
system, the quorums of servers are defined such that any two quorums
intersect in at least 2t+1 servers. In a system
with a maximum of t faulty servers, if each read and write operation
is performed at a quorum, then the quorum used in a read operation
will intersect the quorum used in the last preceding write operation
in at least t+1 correct servers. With appropriate read and write
protocols, this intersection condition ensures that the client is able
to identify the correct, up-to-date data.
Unfortunately, however, there are obstacles to the use of masking quorum
systems as an economical alternative for Byzantine fault tolerance.
For one thing, they remain expensive in that tolerating multiple faults
requires large quorums. Furthermore, the existing read and write protocols
for these systems do not provide atomicity of operations in the
general case.
The purpose of this project is to address these obstacles.
Current Focus
Under construction (but see recent papers below)
Past Highlights
- Achieving Atomic Semantics. We have
recently shown how the problem of atomic read and write operations
on a masking quorum system can be reduced to that of "regular"
operations (in which reads are serializable with respect to writes,
but not necessarily with respect to each other) and developed a
partial solution to the latter problem.
- Managing Quorum Membership. We are
developing non-blocking protocols for adding new servers to a
quorum system and removing those known to be faulty. In addition
to enhancing efficiency, such protocols will reduce the system's
dependence on pre-defined tolerance thresholds, thus improving its
suitability for tolerating non-random faults such as security
breaches.
- Dynamic Quorum Systems.
We have designed protocols to change the size and/or structure
of the quorums so that the system's degree of fault tolerance
("threshold") can be adjusted between a set minimum and maximum
without blocking ordinary read and write operations. This allows
smaller quorums to be used when the number of faults is believed
to be low.
- Fault Detection. We have designed
statistical methods for monitoring the approximate number of faults
in certain types of masking quorum systems, as well as specifically
identifying some of the faulty servers.
Sponsors
Project Members
Collaborators
- Dahlia Mahlki (Hebrew University)
- Mike Reiter (Carnegie Mellon University)
Alumni
Relevant Publications