CS380L: Advanced Operating Systems

Lab #4

Consensus via Paxos

The goal of this assignment is to introduce you to consensus algorithms in general and Paxos in particular. You will implement the Paxos replication algorithm within the context of a distributed systems simulator.

Interface

The provided code implements a simulator for the environment, including servers, clients, and the network. The simulator framework can be found in class dssim_t. Network code can be found in class Net. Servers and clients are represented as classes paxserver and paxclient. Check their definitions in paxclient.h and paxserver.h. Some basic types are defined in paxtypes.h.

Clients generate work, and servers respond to messages. The message handlers have the same names as the corresponding message types. paxmsg.h has the definitions of message types. Implementations of the client's handlers are provided for your reference. You are supposed to implement the handlers in paxos_exec.cpp for servers, which does NOT include view change.

The Paxos log is implemented in class Paxlog. You can treat it as persistent. In general, don't worry about persistence for this lab. Paxlog::trim_front can be used to erase the executed entries. To determine whether an entry is the next to execute, use Paxlog::next_to_exec.

A server maintains a paxobj to represent the underlying state machine. A request is a function that takes the server's paxobj, changes its state, and returns a string. A server logs the requests, and executes log entries when appropriate using paxserver::paxop_on_paxobj. paxserver::paxop_on_paxobj also implements recording last result for each client asynchronously.

To compile, run make pax. Use ./pax -h to see how to configure it.

For this lab, the only modifications you may make are to paxos_exec.cpp. During our testing, ALL OTHER files will be overwritten, so any changes you made will be lost. Please only modify paxos_exec.cpp and check yourself before you hand in the lab.

View change is not implemented in this lab, but the servers need to form an initial view when the simulation starts. We implemented a "fake initial view change" for this problem, and you need to specify the -f option.

Paxos made practical

This lab is based on the paper "Paxos made practical" by David Mazieres. Please read that paper carefully. Here we note several deviations.

We use different message types instead of unions. For example, execute_res is defined as a union message type, but in this lab we explicitly use two message types, execute_success and execute_fail.
In replicate_arg, there is a committed field, which specifies a viewstamp below which the server has executed all requests and sent their results back to clients. Note that the requests need to be executed in successive viewstamp order. In the same view, a request with timestamp ts can only be executed after request with timestamp ts-1 is executed.
The protocol in the paper piggybacks committed viewstamp on the next replicate_arg. In addition, we add accept_arg to propagate vc_state.latest_seen viewstamp when the primary has no unexecuted entries in the Paxos log after receiving a replicate_res. So accept messages are sent quite infrequently, but they do allow all replicas to end the simulation with the same state.
Paramount to Paxos' correctness is the property that requests are executed in successive viewstamp order. The Mazieries paper does not log a request unless it is the next request. ("A backup only logs a request after logging all previous operations, and only acknowledges a request once it has been logged."). Our backups log all requests, and acknowledge the primary after logging. But backups and the primary execute operations in strict viewstamp order (see viewstamp_t::sucessor). Backups execute all requests less than or equal to the committed field in replicate_arg. The quite ugly iterator interface to Paxlog (Paxlog::begin and Paxlog::end) are provided for you to traverse the log to determine which entries can be executed.
Our protocol and implementation are designed to aggressively trim the log. In order to receive full credit, your implementation must also trim the log, ending a series of client requests with an empty log.
The paper says "When executing a request, a cohort must persistently record the result of the last operation it has executed for each client, in case the primary's reply to the client gets lost and the primary subsequently fails." In this lab, we don't record the result of the last operation synchronously because it can be recovered from the log, i.e., Paxlog.
There are no unions, only different message types.

Time

There are two basic notions of time, one is client work, the other is the "ticks" counted off by the simulated system. On each tick, the simulator allows each node to check for messages and run any relevant handlers. On each tick, a node gets to receive a few messages, because in real life message processing latency is much shorter than network delay.

All events in the simulation are controlled by a central random number generator. By changing the seed to this generator, you can change the order of events in your simulation, which can be quite valuable for debugging. See the --seed option.

There are three things that keep the simulation running.

Client work.
Messages in the network.
Unprocessed log entries in replicas.

These conditions are mostly relevant for view change code, but you might need to know them. Clients have a set amount of work to do, and the simulation continues until all clients are finished (see the -c and -r options to pax). We don't want to end the simulation if a node has a message in its input queue, it should process that message. Finally, in order to not "strand" any replicas who might not be in the current view, the simulator continues the simulation until all replicas have emptied their log.

Life in a distributed system is tense. You are constantly worried that you have lost touch with your brethren, or perhaps it only looks like that because they have lost touch with you.

Because of the FLP impossibility result, liveness in distributed systems can be compromised. The simulator has a system of messages and thresholds for timeouts. Given any setting for these parameters, you can force the system into a "tough" state, where it will do something like initiate a view change or possibly just stop responding. Don't worry, you aren't responsible for making your system work across every possible parameter setting (which has been proved impossible). However, in general, if the system initiates a view change, that can mean that one of your nodes isn't attending to its work. For example, if the primary never forwards replicate requests, the replicas will initiate a view change. You don't need to handle the view change, but its existence might indicate a problem with your code.

Another set of parameters that are sensitive is client timeout and number of clients. If you increase the number of clients enormously, and keep the same client timeout, then the servers will get overwhelmed and the clients will start resending reqeusts that are currently in process. This will lead to system overload, with the simulator's queues filling up and the simulator not responding and occupying more and more memory. There is no way to completely eliminate these problems in a distributed system.

Non-FIFO message delivery

Message reordering is controlled by the --delay flag and the --shuffle flag. Both take a value from 0 to 100 interpreted as the percentage of timesteps when a delay or shuffle can occur. You should probably start your work by specifying --delay 0 and --shuffle 0, which will guarantee in order delivery. A delay means that it is possible that even though there is a message at the head of a node's queue, the node will not see the message in that tick. A shuffle means all messages in a node's incoming queue will be randomly permuted.

Just to be totally clear, in order to be correct your code should NOT assume FIFO delivery. We only suggest turning on FIFO delivery as a way to make progress when you start your implementation. Of course, feel free to use the defaults, which is 10% probability of delay and 20% for shuffle.

High values of delay make it probable that for a given tick, a node processes zero messages. If this continues long enough, you will hit a timeout. So don't worry about very large values of delay. I think you and we can stress all interesting interleavings with shuffle == 100 and delay at, for example, 80. Don't start messing with heartbeat messages unless you want to enter a world of pain.

Writeup

Hand in your code and write a few paragraphs about how you handle the important cases for your replication protocol. Also describe the most interesting bug you found in your own code.

Please report how much time you spent on the lab.

Your code is subject to visual inspection and should be clean and readable. Points will be deducted if your code is too hard to follow. Your code should follow best practices, for example avoiding arbitrary constants and checking error codes.