I finally got done reading the Raft paper. I will be the first to admit that I’m a bit off schedule.
I was introduced to Paxos during my undergraduate studies at UIUC. Three times in fact; the first during the distributed systems class, the second during the cloud computing class, and the third during the advanced distributed systems class. Each time I had the same thought while (re)learning the algorithm: Man, this is pretty hard. It always took me a couple of times of reading the slides (for the course) or the (simplified) paper before I understood the algorithm. And I would forget it in about a month. Not the entire algorithm, but bits and pieces. Important bits and pieces unfortunately.
With Raft I’m hoping that will not be the case. It took me 2 reads of the paper to understand pretty much most of the algorithm. It has relatively few moving pieces (only 3 types RPCs!) which makes it pretty easy to understand. The paper does a great job of explaining the algorithm by first explaining how the system would function without safety constraints, and then modifying the RPCs and algorithm to deal with safety. Even with the safety constraint added the algorithm is quite intuitive. This simplicity of the algorithm leads to good performance as well: in the steady state (no crashes, and assuming the client communicates directly with the leader) the leader only has to wait for a response from a majority of followers in order to respond to a client request. My favorite part of the algorithm is leader election. Using randomized election timeouts is a simple yet effective idea.
I would recommend anyone interested in distributed systems read the Raft paper and watch this video as well.