Jet lag, a book, research papers, and a dog

I reached Vadodara on the 23rd and was hit with jet lag that lasted about three days.

I’d bought Mira Jacob’s The Sleepwalker’s Guide to Dancing with the hope that it would last me a majority of my visit to India. I read ~100 pages of the book while in the U.S. On my first night in Vadodara I woke up around 3.30am (because of the afore mentioned jet lag) and finished the remaining ~400 pages in four hours. This is an incredible book, and I can see why it was a nominee for Goodreads’ Best Fiction books of 2014. Simply put, I loved it. And yes, I did appreciate the irony of the title given my battle with temporary-time-difference-induced sleeplessness.

With my reading plans in shambles (OK not completely in shambles since I am not yet done with Borges’ Labyrinths which is also amazing) I decided to work on one of my 2014 goals and read some research papers.

Here is what I read:

    1. Scaling Memcache at Facebook – This paper talks about how Facebook leveraged memcached to build a distributed key-value store. I thought the usage of UDP for making get requests and flow control mechanisms to combat incast was particularly interesting. Overall this is a very good paper and I would highly recommend it to anyone interested in distributed systems and system architecture. My favorite line in the paper? “Simplicity is vital.” (section 9).
    1. MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues – this paper introduces MultiQueues, a concurrent priority queue with relaxed semantics, i.e. you are not always guaranteed to get a globally minimal item from the data structure. Most of the papers I read are systems papers, so it was refreshing to read a paper that dealt more with data structures. MultiQueues are conceptually quite simple and their performance (as shown in section 6 of the paper) is impressive.
    1. You’re Doing It Wrong – this article introduces a B-heap, which is a VM page-friendly implementation of a binary heap. Kamp writes really well, and this article is a joy to read. My main takeaway from this article was the reminder that one should also consider I/O and memory access patterns while analyzing algorithms. This idea was introduced to me in CS-232 at UIUC and it is something that I always try to keep in mind while looking at an algorithm or trying to improve performance.

Oh, and here is the dog!


On time and art

I was looking at one of my favorite works of art recently and realized that it reminds me a lot of the concept of time in distributed systems. The melting clocks are a great illustration for the fact that the notion of time as we understand it makes little sense in a distributed system where each node has its own physical clock that runs independently of the other clocks. In a single node system one can look at timestamps to figure out the ordering of events. However, this concept breaks down for a multi-node system involving multiple clocks due to problems like clock skew. While several types of logical clocks have been created to solve this problem and help come up with causal ordering between events I think it is fascinating that time, something we pretty much take for granted everyday, is something you cannot rely on anymore in a distributed system. This was one of the first problems I was exposed to while studying distributed systems and understanding how one can solve it was extremely intellectually satisfying.

Aside – in the past few years I’ve discovered that I’m a fan of surreal art. Another artist I quite like is Rene Magritte, with The Son of Man, The Human Condition, and The Treachery of Images being my favorite works by him.

On Zab

Apache ZooKeeper has become an indispensable component for many distributed systems: Apache Hadoop, Apache Mesos, Apache Kafka, Apache HBase, etc. all use ZooKeeper in some form or the other. I’ve written code that interacts with ZooKeeper and I’m a big fan of the simple APIs and powerful guarantees it provides.

I had a vague idea of the broadcast protocol that powers ZooKeeper but wasn’t awake of the details. So this weekend I decided to read a short paper that gives an overview of Zab (a more detailed description of Zab can be found in “Zab: High-performance broadcast for primary-backup systems” by Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini).

The title of the paper is extremely accurate – Zab is a very simple protocol that is intuitive and easy to understand. The paper does a great job of explaining the core concepts of the algorithm to the reader. I particularly liked section 3, which includes a comparison between Zab and Paxos. Section 4 is probably the most important section of the paper and is very well written. The figures illustrating the two main failure scenarios are a nice touch.

Next step – read the detailed Zab paper.

On boats and consensus

I finally got done reading the Raft paper. I will be the first to admit that I’m a bit off schedule.

I was introduced to Paxos during my undergraduate studies at UIUC. Three times in fact; the first during the distributed systems class, the second during the cloud computing class, and the third during the advanced distributed systems class. Each time I had the same thought while (re)learning the algorithm: Man, this is pretty hard. It always took me a couple of times of reading the slides (for the course) or the (simplified) paper before I understood the algorithm. And I would forget it in about a month. Not the entire algorithm, but bits and pieces. Important bits and pieces unfortunately.

With Raft I’m hoping that will not be the case. It took me 2 reads of the paper to understand pretty much most of the algorithm. It has relatively few moving pieces (only 3 types RPCs!) which makes it pretty easy to understand. The paper does a great job of explaining the algorithm by first explaining how the system would function without safety constraints, and then modifying the RPCs and algorithm to deal with safety. Even with the safety constraint added the algorithm is quite intuitive. This simplicity of the algorithm leads to good performance as well: in the steady state (no crashes, and assuming the client communicates directly with the leader) the leader only has to wait for a response from a majority of followers in order to respond to a client request. My favorite part of the algorithm is leader election. Using randomized election timeouts is a simple yet effective idea.

I would recommend anyone interested in distributed systems read the Raft paper and watch this video as well.


“Paper” Review: Distributed systems for fun and profit

Book link: Distributed systems for fun and profit

OK, I will admit it, this is not a paper. I think the best description for it is a mini-textbook on distributed systems.

Over the course of five chapters this book goes over most of the major concepts in distributed systems, including synchronous v.s. asynchronous systems, vector clocks, the CAP theorem, detecting failures, and consensus algorithms. While the book doesn’t go in depth into any one of these topics, it helps lay a fantastic foundation on which you can build upon by reading books and papers (more on this in a bit) on distributed systems and related concepts. If your goal is to understand at a high level what systems like Kafka, Cassandra, or MongoDB do and why they are designed the way they are this book is for you. It will make terms like “eventual consistency” and “quorums” make sense when you read documentation for these systems and will help you design your applications better as well.

I think my favorite aspect about this book is that it introduces readers to newer distributed systems concepts like the CALM theorem, RAFT, CRDTs etc.

Another thing that I loved was that each chapter (and the Appendix) also has links to papers and other resources that go more in depth for each concept presented.

Overall, I would highly recommend reading this. If you’re new to distributed systems this book will help you get started. If you live and breathe distributed systems then this is a good resource to have on hand to brush up/look up topics quickly. And it has such pretty diagrams!

Running Fabric tasks in parallel based on roles

We’ve been using Fabric to set up and build Gelato on AWS. Each time I use it I’m left with this sense of awe at how amazing it is. Going from having to manually SSH into each machine to do anything to have Fabric build your code on 15 machines in parallel is indescribable.

One thing that we were having trouble with was having Fabric run a task on specific host roles in parallel. To run tasks in parallel you use the @parallel decorator, while to run tasks on hosts by roles you use the @roles decorator. If you want to run tasks in parallel on specific hosts you have to be careful of the order in which you apply these decorators. Here is what worked for us:

P.S. make sure you set the correct Bubble Size if you have a large number of hosts!