Inspired by a tutorial on TensorFlow that was on HN recently I decided to go and read the TensorFlow paper. This paper has been sitting in my “To Read” folder for quite some time now but for various reasons I never got around to reading it. This is also the first AI/ML paper I’ve read in 2016 so I was excited to dive right in.
At 19 pages long this is one of the longest papers I’ve read. But it is extremely well written, with lots of diagrams, charts, and code samples interspersed throughout the text that make this paper fun to read.
The basic idea of TensorFlow, to have one system that can work across heterogenous computing platforms to solve AI/ML problems, is incredibly powerful. I fell in love with the directed graph API used by TensorFlow to describe computations that will run on it (this may or may not be related to the fact that I also love graph theory). The multi-device (and distributed) execution algorithm explained in the paper is quite intuitive and easy to understand. A major component of multi device / distributed execution of the TensorFlow graph is deciding which device to place a node on. While the paper does explain the algorithm used in section 3.2.1 I wish they had gone into more details and talked about what graph placement algorithms didn’t work, details about the greedy heuristic used, etc.
Sections 5, 6, and 7 were my favorite portions of the paper. Section 5 dives into some of the performance optimizations used in TensorFlow. It would have been awesome if the authors had given more details about the scheduling algorithm used to minimize memory and network bandwidth consumption. I would have also liked to know what other scheduling optimizations were used in TensorFlow as I find scheduling algorithms very interesting.
Section 6 talks about the experience of porting the Inception model over to TensorFlow. While the strategies mentioned in this section are specific to machine learning systems, I feel that some of them can be tweaked a little bit to be generally applicable to all software systems. For instance
“Start small and scale up” (strategy #2)
is directly applicable to any software system. Similarly,
“Make a single machine implementation match before debugging a distributed implementation” (strategy #4)
Can be rephrased as
“Make a single machine implementation work before debugging a distributed implementation”
and be generally applicable to building distributed systems.
Section 7 explains how TensorFlow can be used to speed up stochastic gradient descent (SGD). Again, while the idioms presented in this section are used to speed up SGD, I feel that they are general purpose enough where they can be applied to other algorithms/systems as well. The diagrams in this section are amazing and do a great job of illustrating the differences between the various parallelism and concurrency idioms.
EEG, the internal performance tool mentioned in the paper, sounds very interesting. While it is probably not in the scope of a paper that focuses on TensorFlow I’d love to learn more about EEG. It seems like a very powerful tool and could probably be extended to work with other systems as well.
The paper ends with a survey of related systems. This section proved to be a valuable source for finding new AI/ML and systems papers to read.
I loved this paper.