Sea

(summaries of and key takeaways from two papers I read in December)

Paper: Three States and a Plan: The A.I. of F.E.A.R (this was the first game design paper I’ve read and it was pretty awesome, combining two of my Computer Science interests — graph theory and A.I)

  • Enemy A.I.in F.E.A.R = FSM to express states + A* to plan sequence of actions to reach goal state.
  • Separating goals from how the goals can be achieved (i.e. actions) leads to less complex code, code reusability, and facilitates code composition to build more complex systems.
  • The planning system in F.E.A.R is called Goal-Oriented Action Planning and is based on STRIPS with several modifications.
  • A* is used to find the sequence of actions with the least cost to reach a goal state. A* is used on a graph in which the nodes are states of the world and the edges are actions that cause the world to change from one state to another.
  • Effects and preconditions for actions are represented as a fixed size array capturing the state of the world AND as procedural functions.
  • Squad behavior is implemented by periodically clustering A.I. that are in close physical proximity and issuing squad orders. These orders are simply goals that the A.I. prioritizes (according to its current goals) and satisfies if appropriate.

Paper: Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services

  • Kraken is a system that load tests production systems (data centers or services) at Facebook by diverting live user traffic to the systems under test, and monitoring metrics like p99 latency and 5xx error rates to determine if traffic to the system under test should be increased or decreased, and by what amount.
  • Real user traffic is the best representative of load to your system. By using real user traffic to test production systems you don’t have to worry about capturing complex system dependencies and interactions that arise out of a SOA.
  • Kraken diverts traffic by modifying edge weights (from POPs to data centers), and cluster weights (from web frontend cluster load balancers to the web frontend clusters), and server weights (from service load balancers to individual servers that make up the service).
  • Kraken reads test input and updates configuration files that are read by Proxygen to implement the edge and cluster weighting. Kraken then reads system metrics from Gorilla to dynamically determine how to adjust the edge and cluster weights based on how the system under test is performing.
  • Kraken tests allow Facebook to measure a server’s, cluster’s, and region’s capacity.
  • Kraken helps increase system utilization by exposing bottlenecks. By analyzing system metrics and how they change under different levels of load, Facebook was able to fix problems in their system. One of the issues identified in a system was poor load balancing, for which pick-2 load balancing was used as a solution.

Void

(summaries of and key takeaways from two papers I read last month)

Paper: SLIK: Scalable Low-Latency Indexes for a Key-Value Store

  • Building a low latency, consistent, and scalable secondary index for a NoSQL distributed store is hard.
  • Partitioning your secondary index independently of your data (i.e. not co-locating your secondary index with the data) is key for high performance.
  • SLIK returns consistent data without the need for transactions at write time by using what they term an “ordered write approach”. The SLIK client library shields applications from consistency checking by primary key hashes at read time.
  • I’ve used rule-based programming languages like Prolog before, but I did not know that rule-based programming can be used for non-AI related tasks like concurrent, pipelined RPC requests like SLIK does in its client API implementation.
  • SLIK reuses its underlying system’s (i.e. RAMCloud‘s) storage system to store a representation of the secondary indexes SLIK builds for fast recovery in the face of failure.
  • Measure n times (where n >= 2) cut once: SLIK keeps its design simple by not implementing a garbage collection mechanism to handle invalid secondary index entries. The paper explains how the space saving gained by a garbage collector in their system are negligible.
  • By performing expensive operations like index creation in the background without locking the entire table SLIK ensures that performance never suffers.

Paper:  Caching Doesn’t Improve Mobile Web Performance (Much)

  • Measure n times (where n >= 2) cut once: a 10% increase in cache hit rate in Flywheel only lead to a 1-2% reduction in mobile page load times. This is because of the inherent limitations in web page design and cell phone device hardware (as revealed and evaluated in this paper)A systematic evaluation of the problem (i.e. quantifying the gains of caching in mobile web performance) might have saved engineering effort in improving cache hit rate.
  • I was surprised that page load time was used as an evaluation criteria for cache performance, when above-the-fold load time seems like a more appropriate metric. As revealed in section 3.3 of the paper, this is because above-the-fold load time is harder to measure.
  • The load time for the critical path of a web page determines its overall page load time, and if the elements along the path are not cacheable, then more caching will have zero benefit to page load time. As proved in the experiments detailed in the paper, the amount of data on the critical path that can be cached is much smaller than the amount of overall data that can be cached for most mobile web pages.
  • The bottleneck for mobile web performance is the slow CPUs on mobile devices. Since the computational complexity involved with rendering the page is so high, caching does not give us the page load time reductions we expect on mobile devices.

Domum

(domum means home in Latin)

High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads is a fantastic paper. Its primary focus is how to build distributed systems that are both highly available and strongly consistent. This is achieved by building multi-homed systems. As the paper describes them —

Such systems run hot in multiple datacenters all the time, and adaptively move load between datacenters, with the ability to handle outages of any scale completely transparently. [1]

While the paper mostly addresses building multi-homed systems in the context of distributed stream processing systems, the concepts and ideas are general enough that they can be applied to any large scale distributed software system with some modifications.

Before designing a distributed system that is resilient to failures it is paramount to understand what a failure even means in the context of software systems. Section 3 of the paper talks about common failure scenarios and highlights an important fact — partial failures are common, and “are harder to detect, diagnose, and recover from” [1] (compared to total failures). An important takeaway from this section is that when designing a new system (or trying to improve an old/current system) one should always think about what partial failures can occur, and how the system can/would react to it.

The next section motivates the need for multi-homed systems by first talking about singly-homed and failover-based systems. While singly-homed and failover-based systems are common, one typically does not run into multi-homed systems unless one operates at Google-scale (or close to). Building multi-homed systems is hard. But they offer significant benefits over singly-homed and failover-based systems in the face of (partial or total) failure. Google leverages its existing infrastructure, in particular Spanner, to build multi-homed systems with high availability.

Section 5 is the most interesting portion of the paper and talks about the challenges inherent in building multi-homed system. My main takeaway from this section is that it is virtually impossible to build a multi-homed distributed system without a system like Spanner (which is itself a multi-homed system) serving as the foundation — many of Spanner’s features, like global synchronous replication, reads at a particular version, etc. are used to solve the challenges mentioned in this section.

The paper ends with the description of three multi-homed systems at Google: F1/Spanner, Photon, and Mesa. I highly recommend reading the papers for each of these systems as well, as they have a lot more details about how these complex systems were built.

References
[1] http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44686.pdf

Papir

(This post is a summary of two papers I have recently read. Papir is the Norwegian word for paper)

Real-Time Twitter Recommendation: Online Motif Detection in Large Dynamic Graphs is a paper that was presented at VLDB 2016. It combines two of my favorite topics, distributed systems and graph theory, into a short (2 pages!) paper. It presents a simplified version of the algorithm that Twitter uses to detect motifs in real-time in a user’s social graph, which is then used to generate recommendations for the user. One thing I liked about this paper is that it presents naive solutions to the problem at hand before diving into the elegant solution that Twitter uses. The paper then presents their solution to the problem, and explains how it works at Twitter scale by graph partitioning, pruning, and offline data structure generation.

Design patterns for container-based distributed systems is a paper by Google that talks about software design patterns that are emerging from software systems that are built around containers. Software like Docker and CoreOS has made working with containers easier, and more and more companies are moving towards a container based ecosystem. Google was one of the first companies to use containers, and this paper contains design and architecture patterns that they have observed in their container based systems. The design patterns presented are grouped under three main categories of which I enjoyed reading about “Multi-node application patterns” the most. This sections deals with design patterns in distributed systems, where each node holds multiple related containers (called “pods” in the paper). It was interesting to read about how distributed system problems like leader election, scatter-gather, etc. can be dealt with language agnostic containers rather than by language specific libraries. I loved this line from the end of the paper, which made me think of containers in an entirely new light:

In all cases, containers provide many of the same benefits as objects in object-oriented systems, such as making it easy to divide implementation among multiple teams and to reuse components in new contexts. In addition, they provide some benefits unique to distributed systems, such as enabling components to be upgraded independently, to be written in a mixture of languages, and for the system a whole to degrade gracefully.

 

Intelligence

Inspired by a tutorial on TensorFlow that was on HN recently I decided to go and read the TensorFlow paper. This paper has been sitting in my “To Read” folder for quite some time now but for various reasons I never got around to reading it. This is also the first AI/ML paper I’ve read in 2016 so I was excited to dive right in.

At 19 pages long this is one of the longest papers I’ve read. But it is extremely well written, with lots of diagrams, charts, and code samples interspersed throughout the text that make this paper fun to read.

The basic idea of TensorFlow, to have one system that can work across heterogenous computing platforms to solve AI/ML problems, is incredibly powerful. I fell in love with the directed graph API used by TensorFlow to describe computations that will run on it (this may or may not be related to the fact that I also love graph theory). The multi-device (and distributed) execution algorithm explained in the paper is quite intuitive and easy to understand. A major component of multi device / distributed execution of the TensorFlow graph is deciding which device to place a node on. While the paper does explain the algorithm used in section 3.2.1 I wish they had gone into more details and talked about what graph placement algorithms didn’t work, details about the greedy heuristic used, etc.

Sections 5, 6, and 7 were my favorite portions of the paper. Section 5 dives into some of the performance optimizations used in TensorFlow. It would have been awesome if the authors had given more details about the scheduling algorithm used to minimize memory and network bandwidth consumption. I would have also liked to know what other scheduling optimizations were used in TensorFlow as I find scheduling algorithms very interesting.

Section 6 talks about the experience of porting the Inception model over to TensorFlow. While the strategies mentioned in this section are specific to machine learning systems, I feel that some of them can be tweaked a little bit to be generally applicable to all software systems. For instance

“Start small and scale up” (strategy #2)

is directly applicable to any software system. Similarly,

“Make a single machine implementation match before debugging a distributed implementation” (strategy #4)

Can be rephrased as

“Make a single machine implementation work before debugging a distributed implementation”

and be generally applicable to building distributed systems.

Section 7 explains how TensorFlow can be used to speed up stochastic gradient descent (SGD). Again, while the idioms presented in this section are used to speed up SGD, I feel that they are general purpose enough where they can be applied to other algorithms/systems as well. The diagrams in this section are amazing and do a great job of illustrating the differences between the various parallelism and concurrency idioms.

EEG, the internal performance tool mentioned in the paper, sounds very interesting. While it is probably not in the scope of a paper that focuses on TensorFlow I’d love to learn more about EEG. It seems like a very powerful tool and could probably be extended to work with other systems as well.

The paper ends with a survey of related systems. This section proved to be a valuable source for finding new AI/ML and systems papers to read.

I loved this paper.

 

 

Travel

After 24+ hours of traveling I’m back in San Francisco! The long journey gave me a lot of time to think. And read. And sleep (I think my superpower is falling asleep on airplanes).

Here are some of the things I read:

Research Paper: “f4: Facebook’s Warm BLOB Storage System”

(second paper in my quest to distract myself)

f4: Facebook’s Warm BLOB Storage System introduces the reader to f4, a storage system designed and used at Facebook to store “warm” binary large objects (aka BLOBs). The term “warm” is used to denote the fact that these pieces of data are not as frequently accessed as “hot” BLOBs (which are stored in Haystack). The main motivation behind the design of f4 was the desire to lower the replication factor for warm BLOBs, while still maintaining the same fault tolerance guarantees (node, rack, and datacenter failure) that hot BLOBs have.

The first half of the paper dives into warm BLOBs and their characteristics (section 3) and also gives an overview on how Haystack works (section 4).

Section 5 dives into the details of f4. It explains the overall architecture of the system, how it leverages Reed-Solomon coding to reduce storage overhead (compared to raw replication), how the replication factor for BLOBs is 2.1 (compared to 3.6 in Haystack), how fault tolerance works, etc. The architecture section is very well written and does a good job of explaining the different types of nodes that comprise a f4 cell. My favorite section in the paper is the one that talks about fault tolerance (section 5.5); the “Quadruple Failure Example” in this section is extremely interesting and does a good job of showing how the system deals with failures at various levels. Another part of the paper that I really liked was the section on “Software/Hardware Co-Design” in section 5.6.

Overall this paper was fun to read and very interesting. It had been on my “To Read” list for quite some time now and I’m glad I finally got to it.

Research Paper: “Hekaton: SQL Server’s Memory-Optimized OLTP Engine”

(I’ve noticed that when I’m sad I tend to throw myself at whatever activity catches my fancy at the moment. I do this to distract myself, and in general this seems to work pretty well. To deal with my sadness this time around I will be reading research papers. And blogging. Here’s the first paper I read.)

(Hekaton was one of the systems mentioned in the “Red Book” that piqued my interest)

Hekaton: SQL Server’s Memory-Optimized OLTP Engine gives the reader an overview of Hekaton, a database engine that is a part of Microsoft SQL Server. It has been designed to work with data that fits entirely in main memory. The main motivation driving the design and implementation of Hekaton is the dropping cost of memory and the every growing popularity of multi-core CPUs. In order to achieve the best performance and to take full advantage of the multiple cores the Hekaton embraces a lock/latch-free design: all the index structures (a hash table and B-Tree) are lock/latch-free (details on the design are in [1] and [2]) and transactions use MVCC.

While the details of the implementation of the index data structures are in another paper, this paper does go into details of the MVCC design used and the garbage collection mechanism used to delete old records. Sections 6, 7, and 8 go into details of transactions, logging, and garbage collection. These sections are incredibly well written and do a great job of explaining these complex and core components of the system. The logging and checkpointing system is quite unique and I thought the non-usage (I’m sure there is a better term) of WAL is interesting. Section 8, which goes into details of the garbage collection mechanism used in Hekaton is definitely my favorite section in the paper. I think the GC algorithm is, simply put, beautiful.

Another unique aspect of the system: T-SQL queries and procedures are compiled down into native code to achieve high performance. Section 5 goes into the details of how this is done. What is interesting about this conversion process is that the generated code is one big function with labels and goto statements.

This was a great paper to begin 2016 with.

References

[1] Maged M. Michael. 2002. High performance dynamic lock- free hash tables and list-based sets. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures (SPAA ’02): 73-82.

[2] Levandoski, J.J.; Lomet, D.B.; Sengupta, S., “The Bw-Tree: A B-tree for new hardware platforms,” in Data Engineering (ICDE), 2013 IEEE 29th International Conference on , vol., no., pp.302-313, 8-12 April 2013

Research Paper: “AsterixDB: A Scalable, Open Source BDMS”

(AsterixDB was one of the systems mentioned in the “Red Book” that piqued my interest)

AsterixDB: A Scalable, Open Source BDMS gives the reader an overview of the AsterixDB system. AsterixDB is an impressive “big data management system” (BDMS) with several interesting features including a flexible data model, a powerful query language, data ingestion capabilities and distributed query execution. Two features that stood out to me were the ability to describe custom index types (B+-tree, R-tree, etc.) on your data, and the ability to query data that “lives” outside the system.

A majority of the paper is on the data definition and manipulation layer. The authors use an example of a social networking website to illustrate the power of AsterixDB’s data model and query language. Most of this section consists of code snippets (to define, load, and query the data) followed by an explanation of what exactly that snippet of code does, and what happens under the hood when that snippet is run. These code snippets make this section of the paper very easy to read and understand.

The data storage, indexing, and query execution components are described in the System Architecture section of the paper. These subsystems have separate papers ([1] and [2]) devoted to them; in this paper we are just given a brief overview of how they function and what their key features are. One piece of information that stood out to me in this section was the software layer described that grants any index data structure LSM update semantics. I thought this was a very novel idea to help speed up data ingestion and index building, while at the same time having the benefit of diverse index data structures based on the type of data being stored and indexed. The secondary index design is also interesting.

I really enjoyed reading this paper. I’ve added [1] and [2] to my “research papers to read next” list, and hope to get to it very soon.

[1] S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S. Kim, M. Carey, M. Dressler, and C. Li. Storage Management in AsterixDB. Proc. VLDB Endow., 7(10), June 2014.

[2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A Flexible and Extensible Foundation for Data-intensive Computing. ICDE, 0:1151–1162, 2011.

Red

(The worst part about jet lag is jet lag. The best part about jet lag is that it makes be very productive for some reason. Last year I read a book and a few research papers. This year I finished reading the “Red Book” while not being able to sleep according to the time zone I’m in)

As I’ve mentioned before, databases hold a special place in my heart. I think they’re incredibly interesting pieces of software. State of the art database systems that exist today are result of decades of research and systems engineering. The “Red Book” does a superb job in explaining how we got here, and where we might be going next.

The book is organized into chapters that deal with different components and related areas of database systems. The authors pick a few research papers that are significant in the chapter under discussion and then offer their commentary on them, as well as explain the content of the paper and talk about other related systems/papers/techniques/algorithms. The authors (Peter Bailis, Joseph M. Hellerstein, and Michael Stonebrakerhave a lot of technical expertise in database systems which makes this book an absolute delight to read. I particularly enjoyed the personal anecdotes and commentaries that sprinkled throughout the book. My favorite chapters in the book were the ones on weak isolation and distribution and query optimization.

While reading this book I made note of all research papers that are referenced in this book that I would like to read next. I will be working on that list over the duration of my vacation.