software engineering – Karan Parikh

Oxide

I just finished sections 24 to 36 of chapter 4 of the Rust book. Here’s what I felt:

The crates and modules system to organize projects is simple and intuitive.

Attributes remind me of Java Annotations.

Associated types seem like an improvement over generics. They seem like an important concept to write effective Rust code and I wish that this chapter had gone into more details and had a larger example.

Rust supports macros. As the chapter mentions, I probably wouldn’t write one unless I absolutely had to. If Rust supported variable number of arguments to functions one could probably implement vec! using that plus generics.

unsafe seems like a very powerful and tricky Rust feature. I wish the chapter had an actual example that demonstrated how to use unsafe correctly. And also an example of when not to use unsafe — for example when you’re writing bad Rust code and using unsafe to mask a bad design.

(You can find my thoughts on the previous chapters / sections here)

Ferric

I spent the past few days working through the first 3 chapters and the first 23 sections of chapter 4 of the Rust book. Here are my initial thoughts on Rust:

Cargo is pretty sweet. It’s an easy to understand build and dependency management system. Even though I’m only using it for simple things so far I’m really happy with it and have not run into any issues. I’ve also gotten very used to running cargo new <project_name> [--bin] to start new Rust projects.

Compared to Go, Rust is a much larger language, with many more concepts to learn. Go is a simple language to learn. I feel that Rust has a steeper learning curve.

Memory safety is one of Rust’s strongest selling points. It is one of the trickier concepts to understand and is unlike anything I’ve experienced in C, C++, Java, Go, Python, etc. I’d say the concepts that come close to resembling it are unique_ptr and shared_ptr in C++11. Consequently I spent the most time on the three sections dedicated to references and borrowing, lifetimes, and mutability. Most of the bugs that I ran into while writing Rust code were also related to these concepts.

At first glance strings in Rust seem pretty complicated.

Matching and patterns are awesome. It reminds me of pattern matching in Scala and OCaml.

Rust has generics. This was something I missed in Go.

I haven’t gotten used to writing functions without an explicit return yet.

The Rust book is very well written but there are a few areas of improvement. My major gripe is that it introduces new language constructs but doesn’t explain what they do. For instance the chapter on trait objects introduces the format! macro for the first time without explaining what it does, the chapter on closures uses a Box to return a closure from a function without going into what exactly a Box is etc.

Inspired by a tutorial on TensorFlow that was on HN recently I decided to go and read the TensorFlow paper. This paper has been sitting in my “To Read” folder for quite some time now but for various reasons I never got around to reading it. This is also the first AI/ML paper I’ve read in 2016 so I was excited to dive right in.

At 19 pages long this is one of the longest papers I’ve read. But it is extremely well written, with lots of diagrams, charts, and code samples interspersed throughout the text that make this paper fun to read.

The basic idea of TensorFlow, to have one system that can work across heterogenous computing platforms to solve AI/ML problems, is incredibly powerful. I fell in love with the directed graph API used by TensorFlow to describe computations that will run on it (this may or may not be related to the fact that I also love graph theory). The multi-device (and distributed) execution algorithm explained in the paper is quite intuitive and easy to understand. A major component of multi device / distributed execution of the TensorFlow graph is deciding which device to place a node on. While the paper does explain the algorithm used in section 3.2.1 I wish they had gone into more details and talked about what graph placement algorithms didn’t work, details about the greedy heuristic used, etc.

Sections 5, 6, and 7 were my favorite portions of the paper. Section 5 dives into some of the performance optimizations used in TensorFlow. It would have been awesome if the authors had given more details about the scheduling algorithm used to minimize memory and network bandwidth consumption. I would have also liked to know what other scheduling optimizations were used in TensorFlow as I find scheduling algorithms very interesting.

Section 6 talks about the experience of porting the Inception model over to TensorFlow. While the strategies mentioned in this section are specific to machine learning systems, I feel that some of them can be tweaked a little bit to be generally applicable to all software systems. For instance

“Start small and scale up” (strategy #2)

is directly applicable to any software system. Similarly,

“Make a single machine implementation match before debugging a distributed implementation” (strategy #4)

Can be rephrased as

“Make a single machine implementation work before debugging a distributed implementation”

and be generally applicable to building distributed systems.

Section 7 explains how TensorFlow can be used to speed up stochastic gradient descent (SGD). Again, while the idioms presented in this section are used to speed up SGD, I feel that they are general purpose enough where they can be applied to other algorithms/systems as well. The diagrams in this section are amazing and do a great job of illustrating the differences between the various parallelism and concurrency idioms.

EEG, the internal performance tool mentioned in the paper, sounds very interesting. While it is probably not in the scope of a paper that focuses on TensorFlow I’d love to learn more about EEG. It seems like a very powerful tool and could probably be extended to work with other systems as well.

The paper ends with a survey of related systems. This section proved to be a valuable source for finding new AI/ML and systems papers to read.

I loved this paper.

Frequent

(inspired by this post)

Here is a (definitely incomplete) list of programming / technology related websites (in no particular order) that I frequently read —

High Scalability: Excellent articles on real world software architecture and design.
Julia Evan’s Blog: Well written and fun to read posts on a wide range of topics from system programming to machine learning.
Preshing on Programming: A great resource for data structures and C++.
The Morning Paper: Research papers. Research papers. Research papers.
Hacker News: I feel this is the best resource to stay up-to-date on what’s happening in the fields of technology, software engineering, and computer science.
Dan Luu’s blog: I hope to one day be able to write technical posts as well as they do.
LWN.net: Incredible posts about system engineering concepts.

Cache

One long standing issue that I’ve had with the LinkedIn GitHub page that I helped design was that it because it relied on the GitHub public API to fetch all the data, if the user accessed the page from a rate-limited IP address the rendered page would be blank as no data would be returned by the API. I had some time on my hands today and decided to fix this bug.

The simplest fix for this bug is to cache the GiHub API response in a file, and when you get rate-limited by the GitHub API fall back to reading from a cached API response. Since the raw API response contained lots of information that was not required to generate the website, I decided to add an intermediate filtering step to only extract the relevant information from the raw GitHub API response. The JSON data generated by this filtering step is the final cache used by the webpage.

To test the code I’d written and to make sure everything works as expected I needed to rate limit myself. This was easily done using the (amazing) Python Requests library.

You can find my fix for this bug here.

Update — I realized that my original patch failed to use the GitHub API response when the user was not rate limited. My last commit should fix this.

Analyze

Inspired by Spotify’s year in music feature (I wrote a post on it as well), I decided to analyze music related data that I had at my disposal. The data that I chose was the list of all the artists that I’ve seen live (78 at the time of doing this analysis).

There were two things that I wanted to surface from this data:

Which genres of music have I seen the most live?
Which artists should I see next, based on the artists I’ve already seen?

To answer both these questions I decided to use the Echo Nest API. And Python. All the code I wrote to analyze the data can be found here. I wrote this code when I should have been sleeping so the quality is not the best. Oh well.

About halfway through writing the code I decided that generating a word cloud for #1 would be cooler than simply listing the top genres. After failing miserably to get word_cloud working on my machine I decided to use an online word cloud generator instead. Here’s the resulting word cloud:

Screen Shot 2016-02-23 at 6.58.51 PM

The technique I used to answer #2 was to get the list of similar artists for each artist I’ve seen live, remove artists that I’ve already seen, and keep track of how many times each unseen artist is listed as a similar artist. Here are the top recommendations generated by my algorithm (format: <artist, number of times listed as similar artist>):

Swedish House Mafia, 5
The Raconteurs, 4
Cut Copy, 3
Beach Fossils, 3
Kaiser Chiefs, 3
Iron Maiden, 3
Dio, 3
Ellie Goulding, 2
Black Sabbath, 2 (seeing them in September)
Animals as Leaders, 2

My recommendation algorithm is extremely simple but produced surprisingly good results.

The Echo Nest API is incredible.

P.S. I tried using pyechonest but there didn’t seem to be a way to retrieve artist genre information which is why I decided to use their API directly.

Quality

I loved Fred’s post on the Zen of Erlang. I decided to check out his blog on the bus ride back from work today and read a few of his other posts. Two posts stood out to me.

Lessons Learned while Working on Large-Scale Server Software is, in my mind, required reading for any software engineer working on backend and infrastructure systems. Knowledge of lot of the concepts mentioned in this post (like the CAP Theorem, or the Fallacies of distributed computing) is essential in developing robust software systems. Fred’s style of writing is lots of fun to read, and I really his views on computer networks in this post —

There’s nothing more dangerous than someone going for a stroll over the network without knowing what it entails.

The network owes you nothing, and it doesn’t care about your feelings. It doesn’t deserve your trust.

The network is a necessary evil, not a place to expand to for fun.

The second post that stood out to me on how Queues Don’t Fix Overload. He explains in simple terms why queues (when used incorrectly) seem to solve your scaling problems in the short run while introducing a whole new class of problems of their own. As mentioned in the post, identifying bottlenecks in your system and scaling and fixing those is the correct way to deal with system overload.

Crash

Fred’s post on the Zen of Erlang is delightful. Fred (the author of ‘Learn You Some Erlang for Great Good!‘) does a fantastic job of explaining how Erlang embraces failure and crashes, and how it provides abstractions to deal with these so that the programer can focus on core application logic. Even if you don’t use Erlang the post is full of good software architecture patterns and principles that can be applied to any programming language and software project.

This post is making me question my decision of focussing solely on learning Rust this year.

Bisect

Have you ever been in a situation in which something has “gone wrong” (intentionally vague) between two git commits, say c1 and c2, and you’re trying to figure out which commit caused the issue? In other words, your code works fine at c1, but not at c2. Thus, a commit in the range (c1, c2] resulted in your code being in a “bad” (for some definition of “bad”) state.

One approach is to look at all the commits between (c1, c2] and see if any commit stands out as something that might have caused the issue. But there are times when looking at the changes is not enough, or it’s not clear why any of the changes would have broken anything, and you need to do some other work (run integration tests, run performance suites, run UI tests, etc.) in order to pinpoint the breaking commit.

“Why, this seems like a perfect opportunity to use binary search to figure out which commit caused a problem! All I need to do is a binary search in the range (c1, c2]. For a particular commit in this range (starting in the middle) I simply need to git checkout the code at that point, do whatever work I need to (explained above), and then make a decision on whether I need to search in the ‘upper half’ or ‘lower half'”

Enter git bisect. It allows you to focus on what went wrong, without having to manage the git + binary search state. In our scenario we’d simply mark c1 as a good commit, and c2 as a bad one, and then let git bisect work its magic in enabling us to discover what went wrong between (c1, c2].

I love git.

Research Paper: “AsterixDB: A Scalable, Open Source BDMS”

(AsterixDB was one of the systems mentioned in the “Red Book” that piqued my interest)

AsterixDB: A Scalable, Open Source BDMS gives the reader an overview of the AsterixDB system. AsterixDB is an impressive “big data management system” (BDMS) with several interesting features including a flexible data model, a powerful query language, data ingestion capabilities and distributed query execution. Two features that stood out to me were the ability to describe custom index types (B+-tree, R-tree, etc.) on your data, and the ability to query data that “lives” outside the system.

A majority of the paper is on the data definition and manipulation layer. The authors use an example of a social networking website to illustrate the power of AsterixDB’s data model and query language. Most of this section consists of code snippets (to define, load, and query the data) followed by an explanation of what exactly that snippet of code does, and what happens under the hood when that snippet is run. These code snippets make this section of the paper very easy to read and understand.

The data storage, indexing, and query execution components are described in the System Architecture section of the paper. These subsystems have separate papers ([1] and [2]) devoted to them; in this paper we are just given a brief overview of how they function and what their key features are. One piece of information that stood out to me in this section was the software layer described that grants any index data structure LSM update semantics. I thought this was a very novel idea to help speed up data ingestion and index building, while at the same time having the benefit of diverse index data structures based on the type of data being stored and indexed. The secondary index design is also interesting.

I really enjoyed reading this paper. I’ve added [1] and [2] to my “research papers to read next” list, and hope to get to it very soon.

[1] S. Alsubaiee, A. Behm, V. Borkar, Z. Heilbron, Y.-S. Kim, M. Carey, M. Dressler, and C. Li. Storage Management in AsterixDB. Proc. VLDB Endow., 7(10), June 2014.

[2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A Flexible and Extensible Foundation for Data-intensive Computing. ICDE, 0:1151–1162, 2011.

Tag: software engineering