Here are two research papers I recently read:
(my thoughts and summary on the second paper I read this month as part of my endeavor to read ten research papers in June)
This research paper introduces readers to MemC3, which stands for Memcached with CLOCK and Concurrent Cuckoo Hashing. MemC3 is Memcached with modifications to make it faster and use less space. The researchers achieve the speed boost by allowing multiple reads and a single write to happen concurrently using their own concurrent implementation of Cuckoo Hashing. The space savings are gained by using the CLOCK algorithm to implement LRU cache eviction.
The authors begin by examining the existing use cases and design choices of Memcached. They notice that most Memcached use cases are read heavy, and that Memcached either (a) uses a single thread, or (b) (in newer versions) uses a single global lock. These two observations highlight areas where Memcached’s performance can be improved.
To improve the read and write performance and to enable concurrent operations the researchers replace Memcached hash table + linked list based cache with a 4-way set associate hash table that uses Cuckoo Hashing. I hadn’t heard of Cuckoo Hashing before reading this paper, and it is quite an interesting concept. To allow multiple readers and one writer the authors develop their own concurrent Cuckoo Hashing algorithm that combines lock striping and optimistic locking in a simple and easy to understand algorithm. This section (section 3.2) was the part of the paper that I enjoyed the most.
To save space and to implement an LRU cache eviction policy the authors use the CLOCK algorithm. The paper also describes how their cache eviction algorithm works safely with their concurrent cuckoo hashing algorithm.
The paper ends with an evaluation section that compares MemC3 and Memcached in a variety of configurations and workloads and shows that MemC3 almost always outperforms Memcached.
(my thoughts and summary on the first paper I read this month as part of my endeavor to read ten research papers in June)
This paper introduces readers to PCHECK, a tool the researchers designed to check configuration values used by programs. The goal of PCHECK is to check configuration values at system initialization time, and fail fast if any errors in these configuration values are detected. PCHECK accomplishes this by emulating application code that uses configuration values, and also ensures that this emulation does not cause any internal or external side effects.
The authors motivate the design of PCHECK by observing that most software systems do not check configuration value at system initialization time. Configuration values are typically checked right before they are used. This is a problem because an incorrect configuration value might cause the entire software system to fail, long after the system has started. This is especially problematic when a configuration error causes a fault tolerance or system resiliency component to fail; such components are typically instantiated only under exceptional circumstances, hence the configuration values used by these components might not be checked for correctness at overall system start up time.
PCHECK works by generating checkers that are run at system initialization time. These checkers emulate configuration value usage in the actual system code. PCHECK captures all the context (global variables, local dependent variables, etc.) required to emulate these usages. PCHECK checkers are side effect free and work by creating copies of all the system variables so as to not modify any state. PCHECK handles system calls,
libc function calls, core Java library function calls, etc. by providing what they term a “check utility”, which is essentially a mock function that doesn’t mutate external state but validates the function call arguments. Other external dependencies that might arise while using a configuration value, such as reading/writing to an external file, sending/receiving data from the network, etc. are dealt with in a similar fashion; external files are checked for metadata (while the paper does not explicitly say what these metadata checks are, I’m assuming they are something along the lines of “Does this file exist?”, “Do I have the permissions to read/write to this file?”, etc.) and network addresses are checked for reachability. Configuration issues are detected by the PCHECK checkers by checking for the presence of
errno values, and signs by the program (for example the program terminating with an
exit(1)) that something went wrong while running the emulated configuration value usage code.
(summaries of and key takeaways from two papers I read in December)
Paper: Three States and a Plan: The A.I. of F.E.A.R (this was the first game design paper I’ve read and it was pretty awesome, combining two of my Computer Science interests — graph theory and A.I)
- Enemy A.I.in F.E.A.R = FSM to express states + A* to plan sequence of actions to reach goal state.
- Separating goals from how the goals can be achieved (i.e. actions) leads to less complex code, code reusability, and facilitates code composition to build more complex systems.
- The planning system in F.E.A.R is called Goal-Oriented Action Planning and is based on STRIPS with several modifications.
- A* is used to find the sequence of actions with the least cost to reach a goal state. A* is used on a graph in which the nodes are states of the world and the edges are actions that cause the world to change from one state to another.
- Effects and preconditions for actions are represented as a fixed size array capturing the state of the world AND as procedural functions.
- Squad behavior is implemented by periodically clustering A.I. that are in close physical proximity and issuing squad orders. These orders are simply goals that the A.I. prioritizes (according to its current goals) and satisfies if appropriate.
- Kraken is a system that load tests production systems (data centers or services) at Facebook by diverting live user traffic to the systems under test, and monitoring metrics like p99 latency and 5xx error rates to determine if traffic to the system under test should be increased or decreased, and by what amount.
- Real user traffic is the best representative of load to your system. By using real user traffic to test production systems you don’t have to worry about capturing complex system dependencies and interactions that arise out of a SOA.
- Kraken diverts traffic by modifying edge weights (from POPs to data centers), and cluster weights (from web frontend cluster load balancers to the web frontend clusters), and server weights (from service load balancers to individual servers that make up the service).
- Kraken reads test input and updates configuration files that are read by Proxygen to implement the edge and cluster weighting. Kraken then reads system metrics from Gorilla to dynamically determine how to adjust the edge and cluster weights based on how the system under test is performing.
- Kraken tests allow Facebook to measure a server’s, cluster’s, and region’s capacity.
- Kraken helps increase system utilization by exposing bottlenecks. By analyzing system metrics and how they change under different levels of load, Facebook was able to fix problems in their system. One of the issues identified in a system was poor load balancing, for which pick-2 load balancing was used as a solution.
(summaries of and key takeaways from two papers I read last month)
- Building a low latency, consistent, and scalable secondary index for a NoSQL distributed store is hard.
- Partitioning your secondary index independently of your data (i.e. not co-locating your secondary index with the data) is key for high performance.
- SLIK returns consistent data without the need for transactions at write time by using what they term an “ordered write approach”. The SLIK client library shields applications from consistency checking by primary key hashes at read time.
- I’ve used rule-based programming languages like Prolog before, but I did not know that rule-based programming can be used for non-AI related tasks like concurrent, pipelined RPC requests like SLIK does in its client API implementation.
- SLIK reuses its underlying system’s (i.e. RAMCloud‘s) storage system to store a representation of the secondary indexes SLIK builds for fast recovery in the face of failure.
- Measure n times (where n >= 2) cut once: SLIK keeps its design simple by not implementing a garbage collection mechanism to handle invalid secondary index entries. The paper explains how the space saving gained by a garbage collector in their system are negligible.
- By performing expensive operations like index creation in the background without locking the entire table SLIK ensures that performance never suffers.
- Measure n times (where n >= 2) cut once: a 10% increase in cache hit rate in Flywheel only lead to a 1-2% reduction in mobile page load times. This is because of the inherent limitations in web page design and cell phone device hardware (as revealed and evaluated in this paper). A systematic evaluation of the problem (i.e. quantifying the gains of caching in mobile web performance) might have saved engineering effort in improving cache hit rate.
- I was surprised that page load time was used as an evaluation criteria for cache performance, when above-the-fold load time seems like a more appropriate metric. As revealed in section 3.3 of the paper, this is because above-the-fold load time is harder to measure.
- The load time for the critical path of a web page determines its overall page load time, and if the elements along the path are not cacheable, then more caching will have zero benefit to page load time. As proved in the experiments detailed in the paper, the amount of data on the critical path that can be cached is much smaller than the amount of overall data that can be cached for most mobile web pages.
- The bottleneck for mobile web performance is the slow CPUs on mobile devices. Since the computational complexity involved with rendering the page is so high, caching does not give us the page load time reductions we expect on mobile devices.
(domum means home in Latin)
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads is a fantastic paper. Its primary focus is how to build distributed systems that are both highly available and strongly consistent. This is achieved by building multi-homed systems. As the paper describes them —
Such systems run hot in multiple datacenters all the time, and adaptively move load between datacenters, with the ability to handle outages of any scale completely transparently. 
While the paper mostly addresses building multi-homed systems in the context of distributed stream processing systems, the concepts and ideas are general enough that they can be applied to any large scale distributed software system with some modifications.
Before designing a distributed system that is resilient to failures it is paramount to understand what a failure even means in the context of software systems. Section 3 of the paper talks about common failure scenarios and highlights an important fact — partial failures are common, and “are harder to detect, diagnose, and recover from”  (compared to total failures). An important takeaway from this section is that when designing a new system (or trying to improve an old/current system) one should always think about what partial failures can occur, and how the system can/would react to it.
The next section motivates the need for multi-homed systems by first talking about singly-homed and failover-based systems. While singly-homed and failover-based systems are common, one typically does not run into multi-homed systems unless one operates at Google-scale (or close to). Building multi-homed systems is hard. But they offer significant benefits over singly-homed and failover-based systems in the face of (partial or total) failure. Google leverages its existing infrastructure, in particular Spanner, to build multi-homed systems with high availability.
Section 5 is the most interesting portion of the paper and talks about the challenges inherent in building multi-homed system. My main takeaway from this section is that it is virtually impossible to build a multi-homed distributed system without a system like Spanner (which is itself a multi-homed system) serving as the foundation — many of Spanner’s features, like global synchronous replication, reads at a particular version, etc. are used to solve the challenges mentioned in this section.
The paper ends with the description of three multi-homed systems at Google: F1/Spanner, Photon, and Mesa. I highly recommend reading the papers for each of these systems as well, as they have a lot more details about how these complex systems were built.
(This post is a summary of two papers I have recently read. Papir is the Norwegian word for paper)
Real-Time Twitter Recommendation: Online Motif Detection in Large Dynamic Graphs is a paper that was presented at VLDB 2016. It combines two of my favorite topics, distributed systems and graph theory, into a short (2 pages!) paper. It presents a simplified version of the algorithm that Twitter uses to detect motifs in real-time in a user’s social graph, which is then used to generate recommendations for the user. One thing I liked about this paper is that it presents naive solutions to the problem at hand before diving into the elegant solution that Twitter uses. The paper then presents their solution to the problem, and explains how it works at Twitter scale by graph partitioning, pruning, and offline data structure generation.
Design patterns for container-based distributed systems is a paper by Google that talks about software design patterns that are emerging from software systems that are built around containers. Software like Docker and CoreOS has made working with containers easier, and more and more companies are moving towards a container based ecosystem. Google was one of the first companies to use containers, and this paper contains design and architecture patterns that they have observed in their container based systems. The design patterns presented are grouped under three main categories of which I enjoyed reading about “Multi-node application patterns” the most. This sections deals with design patterns in distributed systems, where each node holds multiple related containers (called “pods” in the paper). It was interesting to read about how distributed system problems like leader election, scatter-gather, etc. can be dealt with language agnostic containers rather than by language specific libraries. I loved this line from the end of the paper, which made me think of containers in an entirely new light:
In all cases, containers provide many of the same benefits as objects in object-oriented systems, such as making it easy to divide implementation among multiple teams and to reuse components in new contexts. In addition, they provide some benefits unique to distributed systems, such as enabling components to be upgraded independently, to be written in a mixture of languages, and for the system a whole to degrade gracefully.