(domum means home in Latin)
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads is a fantastic paper. Its primary focus is how to build distributed systems that are both highly available and strongly consistent. This is achieved by building multi-homed systems. As the paper describes them —
Such systems run hot in multiple datacenters all the time, and adaptively move load between datacenters, with the ability to handle outages of any scale completely transparently. 
While the paper mostly addresses building multi-homed systems in the context of distributed stream processing systems, the concepts and ideas are general enough that they can be applied to any large scale distributed software system with some modifications.
Before designing a distributed system that is resilient to failures it is paramount to understand what a failure even means in the context of software systems. Section 3 of the paper talks about common failure scenarios and highlights an important fact — partial failures are common, and “are harder to detect, diagnose, and recover from”  (compared to total failures). An important takeaway from this section is that when designing a new system (or trying to improve an old/current system) one should always think about what partial failures can occur, and how the system can/would react to it.
The next section motivates the need for multi-homed systems by first talking about singly-homed and failover-based systems. While singly-homed and failover-based systems are common, one typically does not run into multi-homed systems unless one operates at Google-scale (or close to). Building multi-homed systems is hard. But they offer significant benefits over singly-homed and failover-based systems in the face of (partial or total) failure. Google leverages its existing infrastructure, in particular Spanner, to build multi-homed systems with high availability.
Section 5 is the most interesting portion of the paper and talks about the challenges inherent in building multi-homed system. My main takeaway from this section is that it is virtually impossible to build a multi-homed distributed system without a system like Spanner (which is itself a multi-homed system) serving as the foundation — many of Spanner’s features, like global synchronous replication, reads at a particular version, etc. are used to solve the challenges mentioned in this section.
The paper ends with the description of three multi-homed systems at Google: F1/Spanner, Photon, and Mesa. I highly recommend reading the papers for each of these systems as well, as they have a lot more details about how these complex systems were built.
One thought on “Domum”