I read an interesting article the other day that provided a great explanation, with sweet visualizations none the less, of the Paxos consensus algorithm. One of the papers mentioned in the “More Resources” section was Paxos Made Live. This paper has been on my radar for some time now and seeing it mentioned here inspired me to go ahead and read it.
I think this is one of my favorite papers. What I really liked about it is that it details the problems of translating an algorithm from an abstract theoretical concept into an actual living, breathing system.
Sections 5 and 6 of the paper are ridiculously good. Even something simple like taking a snapshot of a data structure has many subtleties associated with it when paired with a replicated log and I quite enjoyed reading about the snapshot mechanism the team had engineered. The idea of using a state machine language to implement their algorithm was excellent as well. One open source tool that comes to mind that allows you to do this now is Apache Helix.
Distributed systems are extremely hard to test. This is why the section on testing was particularly enlightening for me. I think having explicit hooks in source code to inject failures is quite a powerful idea. I particularly loved one line from this section – “By their very nature, fault-tolerant systems try to mask problems.”
I highly recommend reading this paper.