The goal of this assignment is to introduce you to consensus algorithms in general and Paxos in particular. You will implement the Paxos replication algorithm within the context of a distributed systems simulator.
The provided code implements a simulator for the environment,
including servers, clients, and the network. The simulator framework can
be found in class dssim_t
. Network code can be found
in class Net
. Servers and clients are represented as
classes paxserver
and paxclient
. Check their
definitions in paxclient.h
and
paxserver.h
. Some basic types are defined
in paxtypes.h
.
Clients generate work, and servers respond to messages. The message
handlers have the same names as the corresponding message types.
paxmsg.h
has the definitions of message types.
Implementations of the client's handlers are provided for your
reference. You are supposed to implement the handlers
in paxos_exec.cpp
for servers, which does NOT include
view change.
The Paxos log is implemented in class Paxlog
. You can
treat it as persistent. In general, don't worry about persistence for
this lab.
Paxlog::trim_front
can be used to erase the executed
entries. To determine whether an entry is the next to execute,
use Paxlog::next_to_exec
.
A server maintains a paxobj
to represent the underlying
state machine. A request is a function that takes the
server's paxobj
, changes its state, and returns a
string. A server logs the requests, and executes log entries when
appropriate
using paxserver::paxop_on_paxobj
.
paxserver::paxop_on_paxobj
also implements recording last result for each client asynchronously.
To compile, run make pax
. Use ./pax -h
to see how to configure it.
For this lab, the only modifications you may make are to paxos_exec.cpp. During our testing, ALL OTHER files will be overwritten, so any changes you made will be lost. Please only modify paxos_exec.cpp and check yourself before you hand in the lab.
View change is not implemented in this lab, but the servers need to
form an initial view when the simulation starts. We implemented a
"fake initial view change" for this problem, and you need to specify
the -f
option.
This lab is based on the paper "Paxos made practical" by David Mazieres. Please read that paper carefully. Here we note several deviations.
execute_res
is defined as a union message
type, but in this lab we explicitly use two message types,
execute_success
and execute_fail
. replicate_arg
, there is
a committed
field, which specifies a viewstamp below
which the server has executed all requests and sent their results
back to clients. Note that the requests need to be executed in
successive viewstamp order. In the same view, a request with
timestamp ts
can only be executed after request with
timestamp ts-1
is executed. accept_arg
to
propagate vc_state.latest_seen
viewstamp when the
primary has no unexecuted entries in the Paxos log after receiving a
replicate_res
. So accept messages are sent quite
infrequently, but they do allow all replicas to end the simulation
with the same state.
viewstamp_t::sucessor
). Backups execute all
requests less than or equal to the committed
field in replicate_arg
.
The quite ugly iterator
interface to Paxlog (Paxlog::begin
and
Paxlog::end
) are provided
for you to traverse the log to determine which entries can be
executed.
Paxlog
. There are two basic notions of time, one is client work, the other is the "ticks" counted off by the simulated system. On each tick, the simulator allows each node to check for messages and run any relevant handlers. On each tick, a node gets to receive a few messages, because in real life message processing latency is much shorter than network delay.
All events in the simulation are controlled by a central random
number generator. By changing the seed to this generator, you can
change the order of events in your simulation, which can be quite
valuable for debugging. See the --seed
option.
There are three things that keep the simulation running.
-c
and -r
options
to pax
).
We don't want to end the simulation if a node has a message in its
input queue, it should process that message. Finally, in order to
not "strand" any replicas who might not be in the current view, the
simulator continues the simulation until all replicas have emptied
their log.
Life in a distributed system is tense. You are constantly worried that you have lost touch with your brethren, or perhaps it only looks like that because they have lost touch with you.
Because of the FLP impossibility result, liveness in distributed systems can be compromised. The simulator has a system of messages and thresholds for timeouts. Given any setting for these parameters, you can force the system into a "tough" state, where it will do something like initiate a view change or possibly just stop responding. Don't worry, you aren't responsible for making your system work across every possible parameter setting (which has been proved impossible). However, in general, if the system initiates a view change, that can mean that one of your nodes isn't attending to its work. For example, if the primary never forwards replicate requests, the replicas will initiate a view change. You don't need to handle the view change, but its existence might indicate a problem with your code.
Another set of parameters that are sensitive is client timeout and number of clients. If you increase the number of clients enormously, and keep the same client timeout, then the servers will get overwhelmed and the clients will start resending reqeusts that are currently in process. This will lead to system overload, with the simulator's queues filling up and the simulator not responding and occupying more and more memory. There is no way to completely eliminate these problems in a distributed system.
Message reordering is controlled by the --delay
flag
and the --shuffle
flag. Both take a value from 0 to 100
interpreted as the percentage of timesteps when a delay or shuffle
can occur. You should probably start your work by specifying --delay 0
and --shuffle 0, which will guarantee in order delivery. A delay
means that it is possible that even though there is a message at the
head of a node's queue, the node will not see the message in that
tick. A shuffle means all messages in a node's incoming queue will
be randomly permuted.
Just to be totally clear, in order to be correct your code should NOT assume FIFO delivery. We only suggest turning on FIFO delivery as a way to make progress when you start your implementation. Of course, feel free to use the defaults, which is 10% probability of delay and 20% for shuffle.
High values of delay make it probable that for a given tick, a node processes zero messages. If this continues long enough, you will hit a timeout. So don't worry about very large values of delay. I think you and we can stress all interesting interleavings with shuffle == 100 and delay at, for example, 80. Don't start messing with heartbeat messages unless you want to enter a world of pain.
Hand in your code and write a few paragraphs about how you handle the important cases for your replication protocol. Also describe the most interesting bug you found in your own code.
Please report how much time you spent on the lab.