On Zab

Apache ZooKeeper has become an indispensable component for many distributed systems: Apache Hadoop, Apache Mesos, Apache Kafka, Apache HBase, etc. all use ZooKeeper in some form or the other. I’ve written code that interacts with ZooKeeper and I’m a big fan of the simple APIs and powerful guarantees it provides.

I had a vague idea of the broadcast protocol that powers ZooKeeper but wasn’t awake of the details. So this weekend I decided to read a short paper that gives an overview of Zab (a more detailed description of Zab can be found in “Zab: High-performance broadcast for primary-backup systems” by Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini).

The title of the paper is extremely accurate – Zab is a very simple protocol that is intuitive and easy to understand. The paper does a great job of explaining the core concepts of the algorithm to the reader. I particularly liked section 3, which includes a comparison between Zab and Paxos. Section 4 is probably the most important section of the paper and is very well written. The figures illustrating the two main failure scenarios are a nice touch.

Next step – read the detailed Zab paper.

On boats and consensus

I finally got done reading the Raft paper. I will be the first to admit that I’m a bit off schedule.

I was introduced to Paxos during my undergraduate studies at UIUC. Three times in fact; the first during the distributed systems class, the second during the cloud computing class, and the third during the advanced distributed systems class. Each time I had the same thought while (re)learning the algorithm: Man, this is pretty hard. It always took me a couple of times of reading the slides (for the course) or the (simplified) paper before I understood the algorithm. And I would forget it in about a month. Not the entire algorithm, but bits and pieces. Important bits and pieces unfortunately.

With Raft I’m hoping that will not be the case. It took me 2 reads of the paper to understand pretty much most of the algorithm. It has relatively few moving pieces (only 3 types RPCs!) which makes it pretty easy to understand. The paper does a great job of explaining the algorithm by first explaining how the system would function without safety constraints, and then modifying the RPCs and algorithm to deal with safety. Even with the safety constraint added the algorithm is quite intuitive. This simplicity of the algorithm leads to good performance as well: in the steady state (no crashes, and assuming the client communicates directly with the leader) the leader only has to wait for a response from a majority of followers in order to respond to a client request. My favorite part of the algorithm is leader election. Using randomized election timeouts is a simple yet effective idea.

I would recommend anyone interested in distributed systems read the Raft paper and watch this video as well.

 

“Paper” Review: Distributed systems for fun and profit

Book link: Distributed systems for fun and profit

OK, I will admit it, this is not a paper. I think the best description for it is a mini-textbook on distributed systems.

Over the course of five chapters this book goes over most of the major concepts in distributed systems, including synchronous v.s. asynchronous systems, vector clocks, the CAP theorem, detecting failures, and consensus algorithms. While the book doesn’t go in depth into any one of these topics, it helps lay a fantastic foundation on which you can build upon by reading books and papers (more on this in a bit) on distributed systems and related concepts. If your goal is to understand at a high level what systems like Kafka, Cassandra, or MongoDB do and why they are designed the way they are this book is for you. It will make terms like “eventual consistency” and “quorums” make sense when you read documentation for these systems and will help you design your applications better as well.

I think my favorite aspect about this book is that it introduces readers to newer distributed systems concepts like the CALM theorem, RAFT, CRDTs etc.

Another thing that I loved was that each chapter (and the Appendix) also has links to papers and other resources that go more in depth for each concept presented.

Overall, I would highly recommend reading this. If you’re new to distributed systems this book will help you get started. If you live and breathe distributed systems then this is a good resource to have on hand to brush up/look up topics quickly. And it has such pretty diagrams!

Running Fabric tasks in parallel based on roles

We’ve been using Fabric to set up and build Gelato on AWS. Each time I use it I’m left with this sense of awe at how amazing it is. Going from having to manually SSH into each machine to do anything to have Fabric build your code on 15 machines in parallel is indescribable.

One thing that we were having trouble with was having Fabric run a task on specific host roles in parallel. To run tasks in parallel you use the @parallel decorator, while to run tasks on hosts by roles you use the @roles decorator. If you want to run tasks in parallel on specific hosts you have to be careful of the order in which you apply these decorators. Here is what worked for us:


env.roledefs = {
"service_A": ["hostA1", "hostA2", …],
"service_B": ["hostB1", "hostB2", …],
"service_C": ["hostC1", "hostC2", …],
}
@task
@parallel
@roles("service_A", "service_B", "service_C")
def build():

view raw

fabfile.py

hosted with ❤ by GitHub

P.S. make sure you set the correct Bubble Size if you have a large number of hosts!

Gelato Tech Stack

For our Advanced Distributed Systems (CS 525) final project Onur and I are working on a system we’ve named Gelato. I will have more details about it in a month when we (hopefully) open source our code. Our tech stack for Gelato looks something like this:

We were deciding between Cassandra and HBase and decided to use HBase because HBase has a native Java API and is pretty easy to use on AWS thanks to AWS EMR.

The languages we are using are:

  • Java for the core Gelato system
  • Python to gather performance metrics

Gelato is pretty the most complex system I’ve built during college and I’m really excited to see how it finally turns out.