distributed systems – Karan Parikh

Domum

(domum means home in Latin)

High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads is a fantastic paper. Its primary focus is how to build distributed systems that are both highly available and strongly consistent. This is achieved by building multi-homed systems. As the paper describes them —

Such systems run hot in multiple datacenters all the time, and adaptively move load between datacenters, with the ability to handle outages of any scale completely transparently. [1]

While the paper mostly addresses building multi-homed systems in the context of distributed stream processing systems, the concepts and ideas are general enough that they can be applied to any large scale distributed software system with some modifications.

Before designing a distributed system that is resilient to failures it is paramount to understand what a failure even means in the context of software systems. Section 3 of the paper talks about common failure scenarios and highlights an important fact — partial failures are common, and “are harder to detect, diagnose, and recover from” [1] (compared to total failures). An important takeaway from this section is that when designing a new system (or trying to improve an old/current system) one should always think about what partial failures can occur, and how the system can/would react to it.

The next section motivates the need for multi-homed systems by first talking about singly-homed and failover-based systems. While singly-homed and failover-based systems are common, one typically does not run into multi-homed systems unless one operates at Google-scale (or close to). Building multi-homed systems is hard. But they offer significant benefits over singly-homed and failover-based systems in the face of (partial or total) failure. Google leverages its existing infrastructure, in particular Spanner, to build multi-homed systems with high availability.

Section 5 is the most interesting portion of the paper and talks about the challenges inherent in building multi-homed system. My main takeaway from this section is that it is virtually impossible to build a multi-homed distributed system without a system like Spanner (which is itself a multi-homed system) serving as the foundation — many of Spanner’s features, like global synchronous replication, reads at a particular version, etc. are used to solve the challenges mentioned in this section.

The paper ends with the description of three multi-homed systems at Google: F1/Spanner, Photon, and Mesa. I highly recommend reading the papers for each of these systems as well, as they have a lot more details about how these complex systems were built.

References
[1] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44686.pdf

Papir

(This post is a summary of two papers I have recently read. Papir is the Norwegian word for paper)

Real-Time Twitter Recommendation: Online Motif Detection in Large Dynamic Graphs is a paper that was presented at VLDB 2016. It combines two of my favorite topics, distributed systems and graph theory, into a short (2 pages!) paper. It presents a simplified version of the algorithm that Twitter uses to detect motifs in real-time in a user’s social graph, which is then used to generate recommendations for the user. One thing I liked about this paper is that it presents naive solutions to the problem at hand before diving into the elegant solution that Twitter uses. The paper then presents their solution to the problem, and explains how it works at Twitter scale by graph partitioning, pruning, and offline data structure generation.

Design patterns for container-based distributed systems is a paper by Google that talks about software design patterns that are emerging from software systems that are built around containers. Software like Docker and CoreOS has made working with containers easier, and more and more companies are moving towards a container based ecosystem. Google was one of the first companies to use containers, and this paper contains design and architecture patterns that they have observed in their container based systems. The design patterns presented are grouped under three main categories of which I enjoyed reading about “Multi-node application patterns” the most. This sections deals with design patterns in distributed systems, where each node holds multiple related containers (called “pods” in the paper). It was interesting to read about how distributed system problems like leader election, scatter-gather, etc. can be dealt with language agnostic containers rather than by language specific libraries. I loved this line from the end of the paper, which made me think of containers in an entirely new light:

In all cases, containers provide many of the same benefits as objects in object-oriented systems, such as making it easy to divide implementation among multiple teams and to reuse components in new contexts. In addition, they provide some benefits unique to distributed systems, such as enabling components to be upgraded independently, to be written in a mixture of languages, and for the system a whole to degrade gracefully.

Quality

I loved Fred’s post on the Zen of Erlang. I decided to check out his blog on the bus ride back from work today and read a few of his other posts. Two posts stood out to me.

Lessons Learned while Working on Large-Scale Server Software is, in my mind, required reading for any software engineer working on backend and infrastructure systems. Knowledge of lot of the concepts mentioned in this post (like the CAP Theorem, or the Fallacies of distributed computing) is essential in developing robust software systems. Fred’s style of writing is lots of fun to read, and I really his views on computer networks in this post —

There’s nothing more dangerous than someone going for a stroll over the network without knowing what it entails.

The network owes you nothing, and it doesn’t care about your feelings. It doesn’t deserve your trust.

The network is a necessary evil, not a place to expand to for fun.

The second post that stood out to me on how Queues Don’t Fix Overload. He explains in simple terms why queues (when used incorrectly) seem to solve your scaling problems in the short run while introducing a whole new class of problems of their own. As mentioned in the post, identifying bottlenecks in your system and scaling and fixing those is the correct way to deal with system overload.

Research Paper: “AsterixDB: A Scalable, Open Source BDMS”

(AsterixDB was one of the systems mentioned in the “Red Book” that piqued my interest)

AsterixDB: A Scalable, Open Source BDMS gives the reader an overview of the AsterixDB system. AsterixDB is an impressive “big data management system” (BDMS) with several interesting features including a flexible data model, a powerful query language, data ingestion capabilities and distributed query execution. Two features that stood out to me were the ability to describe custom index types (B+-tree, R-tree, etc.) on your data, and the ability to query data that “lives” outside the system.

A majority of the paper is on the data definition and manipulation layer. The authors use an example of a social networking website to illustrate the power of AsterixDB’s data model and query language. Most of this section consists of code snippets (to define, load, and query the data) followed by an explanation of what exactly that snippet of code does, and what happens under the hood when that snippet is run. These code snippets make this section of the paper very easy to read and understand.

The data storage, indexing, and query execution components are described in the System Architecture section of the paper. These subsystems have separate papers ([1] and [2]) devoted to them; in this paper we are just given a brief overview of how they function and what their key features are. One piece of information that stood out to me in this section was the software layer described that grants any index data structure LSM update semantics. I thought this was a very novel idea to help speed up data ingestion and index building, while at the same time having the benefit of diverse index data structures based on the type of data being stored and indexed. The secondary index design is also interesting.

I really enjoyed reading this paper. I’ve added [1] and [2] to my “research papers to read next” list, and hope to get to it very soon.

[1] S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S. Kim, M. Carey, M. Dressler, and C. Li. Storage Management in AsterixDB. Proc. VLDB Endow., 7(10), June 2014.

[2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A Flexible and Extensible Foundation for Data-intensive Computing. ICDE, 0:1151–1162, 2011.

Paper Review: Paxos Explained from Scratch

“Tutorial Summary: Paxos Explained from Scratch” is an extremely unique and interesting paper. As evident from the title, the paper attempts to explain the Paxos algorithm to the reader. What makes this paper great is that it builds up the Paxos algorithm step-by-step.

The Paxos algorithm is explained in the context of building a replicated state machine. The authors begin with a simple algorithm for consensus. By injecting failures in this simple algorithm we eventually derive the Paxos algorithm in a very natural fashion.

This is the first time I’ve read a bottom-up explanation of Paxos and I thought it was quite easy to understand. Each algorithm they present (building up to the Paxos algorithm) is also accompanied by a pictorial explanation which made concepts even more clear.

Overall, I loved this paper. If you’re looking to refresh your knowledge on the Paxos algorithm I would recommend reading this paper, followed by Paxos Made Simple.

AOSA: The NoSQL Ecosystem Review

(this is a review of the chapter on The NoSQL Ecosystem in the Architecture of Open Source Applications)

Unlike the other chapters in the book (and as stated in the introduction to the chapter), this portion of the book doesn’t dive deep into the internals of one particular project. Rather, it gives readers an overview of the various algorithms and concepts that serve as the building blocks for NoSQL systems like Voldemort, Cassandra, HBase, etc. I think it does a great job at explaining what is out there once one moves away from the traditional relational model and SQL world. It also references several seminal papers, like the Google BigTable and Amazon Dynamo papers, which I urge people to read if they are interested in understanding more about the topics covered in this chapter.

Speaking as someone who has read numerous papers on distributed and NoSQL systems (as well as studied them in several courses at UIUC) I feel like I didn’t gain a whole lot from reading this chapter. It was still a very enjoyable read and I really liked the sections that talked about fsync, read repair, hinted handoff, and anti-entropy. The section on the differences between range-based and hash-based partitioning was excellent as well. One thing I particularly liked was the author’s use of examples to explain concepts like the relational model, range-based partitioning, hash rings, etc.

If you have zero or very little background in NoSQL systems I would highly recommend reading this chapter.

Paper Review – Paxos Made Live

I read an interesting article the other day that provided a great explanation, with sweet visualizations none the less, of the Paxos consensus algorithm. One of the papers mentioned in the “More Resources” section was Paxos Made Live. This paper has been on my radar for some time now and seeing it mentioned here inspired me to go ahead and read it.

I think this is one of my favorite papers. What I really liked about it is that it details the problems of translating an algorithm from an abstract theoretical concept into an actual living, breathing system.

Sections 5 and 6 of the paper are ridiculously good. Even something simple like taking a snapshot of a data structure has many subtleties associated with it when paired with a replicated log and I quite enjoyed reading about the snapshot mechanism the team had engineered. The idea of using a state machine language to implement their algorithm was excellent as well. One open source tool that comes to mind that allows you to do this now is Apache Helix.

Distributed systems are extremely hard to test. This is why the section on testing was particularly enlightening for me. I think having explicit hooks in source code to inject failures is quite a powerful idea. I particularly loved one line from this section – “By their very nature, fault-tolerant systems try to mask problems.”

I highly recommend reading this paper.

Jet lag, a book, research papers, and a dog

I reached Vadodara on the 23rd and was hit with jet lag that lasted about three days.

I’d bought Mira Jacob’s The Sleepwalker’s Guide to Dancing with the hope that it would last me a majority of my visit to India. I read ~100 pages of the book while in the U.S. On my first night in Vadodara I woke up around 3.30am (because of the afore mentioned jet lag) and finished the remaining ~400 pages in four hours. This is an incredible book, and I can see why it was a nominee for Goodreads’ Best Fiction books of 2014. Simply put, I loved it. And yes, I did appreciate the irony of the title given my battle with temporary-time-difference-induced sleeplessness.

With my reading plans in shambles (OK not completely in shambles since I am not yet done with Borges’ Labyrinths which is also amazing) I decided to work on one of my 2014 goals and read some research papers.

Here is what I read:

Scaling Memcache at Facebook – This paper talks about how Facebook leveraged memcached to build a distributed key-value store. I thought the usage of UDP for making get requests and flow control mechanisms to combat incast was particularly interesting. Overall this is a very good paper and I would highly recommend it to anyone interested in distributed systems and system architecture. My favorite line in the paper? “Simplicity is vital.” (section 9).

MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues – this paper introduces MultiQueues, a concurrent priority queue with relaxed semantics, i.e. you are not always guaranteed to get a globally minimal item from the data structure. Most of the papers I read are systems papers, so it was refreshing to read a paper that dealt more with data structures. MultiQueues are conceptually quite simple and their performance (as shown in section 6 of the paper) is impressive.

You’re Doing It Wrong – this article introduces a B-heap, which is a VM page-friendly implementation of a binary heap. Kamp writes really well, and this article is a joy to read. My main takeaway from this article was the reminder that one should also consider I/O and memory access patterns while analyzing algorithms. This idea was introduced to me in CS-232 at UIUC and it is something that I always try to keep in mind while looking at an algorithm or trying to improve performance.

Oh, and here is the dog!

On time and art

I was looking at one of my favorite works of art recently and realized that it reminds me a lot of the concept of time in distributed systems. The melting clocks are a great illustration for the fact that the notion of time as we understand it makes little sense in a distributed system where each node has its own physical clock that runs independently of the other clocks. In a single node system one can look at timestamps to figure out the ordering of events. However, this concept breaks down for a multi-node system involving multiple clocks due to problems like clock skew. While several types of logical clocks have been created to solve this problem and help come up with causal ordering between events I think it is fascinating that time, something we pretty much take for granted everyday, is something you cannot rely on anymore in a distributed system. This was one of the first problems I was exposed to while studying distributed systems and understanding how one can solve it was extremely intellectually satisfying.

Aside – in the past few years I’ve discovered that I’m a fan of surreal art. Another artist I quite like is Rene Magritte, with The Son of Man, The Human Condition, and The Treachery of Images being my favorite works by him.

“The Future of Software Development”

This talk by Chris Granger is superb. I’m definitely going to spend more time learning about CRDTs. My favorite tidbit from the talk (around the 8:55 mark) – Apple’s auto layout in iOS uses a linear inequality constraint solver.