One long standing issue that I’ve had with the LinkedIn GitHub page that I helped design was that it because it relied on the GitHub public API to fetch all the data, if the user accessed the page from a rate-limited IP address the rendered page would be blank as no data would be returned by the API. I had some time on my hands today and decided to fix this bug.
The simplest fix for this bug is to cache the GiHub API response in a file, and when you get rate-limited by the GitHub API fall back to reading from a cached API response. Since the raw API response contained lots of information that was not required to generate the website, I decided to add an intermediate filtering step to only extract the relevant information from the raw GitHub API response. The JSON data generated by this filtering step is the final cache used by the webpage.
To test the code I’d written and to make sure everything works as expected I needed to rate limit myself. This was easily done using the (amazing) Python Requests library.
You can find my fix for this bug here.
Update — I realized that my original patch failed to use the GitHub API response when the user was not rate limited. My last commit should fix this.
Inspired by Spotify’s year in music feature (I wrote a post on it as well), I decided to analyze music related data that I had at my disposal. The data that I chose was the list of all the artists that I’ve seen live (78 at the time of doing this analysis).
There were two things that I wanted to surface from this data:
- Which genres of music have I seen the most live?
- Which artists should I see next, based on the artists I’ve already seen?
To answer both these questions I decided to use the Echo Nest API. And Python. All the code I wrote to analyze the data can be found here. I wrote this code when I should have been sleeping so the quality is not the best. Oh well.
About halfway through writing the code I decided that generating a word cloud for #1 would be cooler than simply listing the top genres. After failing miserably to get word_cloud working on my machine I decided to use an online word cloud generator instead. Here’s the resulting word cloud:
The technique I used to answer #2 was to get the list of similar artists for each artist I’ve seen live, remove artists that I’ve already seen, and keep track of how many times each unseen artist is listed as a similar artist. Here are the top recommendations generated by my algorithm (format: <artist, number of times listed as similar artist>):
- Swedish House Mafia, 5
- The Raconteurs, 4
- Cut Copy, 3
- Beach Fossils, 3
- Kaiser Chiefs, 3
- Iron Maiden, 3
- Dio, 3
- Ellie Goulding, 2
- Black Sabbath, 2 (seeing them in September)
- Animals as Leaders, 2
My recommendation algorithm is extremely simple but produced surprisingly good results.
The Echo Nest API is incredible.
P.S. I tried using pyechonest but there didn’t seem to be a way to retrieve artist genre information which is why I decided to use their API directly.
(ouvert means open in French)
Last week I contributed a small feature to clize. As before, I discovered this project on the GitHub page for trending Python repositories. The author had a list of open issues for the repository which made it easy to see what needed to be worked on and I picked one that caught my fancy.
Once I knew what needed to be done I had to figure out how to implement it. The first thing I did was see how the existing code handled unknown command line arguments. “Oh look, it printed ‘Unknown option’! That seems like a good place to start.” I ran an
ack for the phrase “Unknown option” and found the relevant source code files. The next step was to figure out from where the parsed arguments lived inside the program. A well placed
print statement that I added quickly solved that mystery.
With this knowledge in hand I began writing some code. The basic algorithm was pretty simple – in case the user enters a command line argument that is not one of the parsed arguments compute the Levenshtein distance between what the user entered and the known arguments and suggest one that has the lowest distance. This was more or less the initial pull request that I submitted. The author provided excellent feedback on my code and after a couple of iterations my commit was merged into the master branch.
Things I learnt along the way –
We’ve been using Fabric to set up and build Gelato on AWS. Each time I use it I’m left with this sense of awe at how amazing it is. Going from having to manually SSH into each machine to do anything to have Fabric build your code on 15 machines in parallel is indescribable.
One thing that we were having trouble with was having Fabric run a task on specific host roles in parallel. To run tasks in parallel you use the @parallel decorator, while to run tasks on hosts by roles you use the @roles decorator. If you want to run tasks in parallel on specific hosts you have to be careful of the order in which you apply these decorators. Here is what worked for us:
P.S. make sure you set the correct Bubble Size if you have a large number of hosts!
Facebook held it’s Camp Hackathon at UIUC yesterday, and it was another great experience. Sam and I built a system to remotely control your iTunes music library via text messaging, a web interface and voice. Technologies used were Python, PHP, Twilio and NodeJS (NowJS and Express). I’m a huge fan of NowJS, it’s an excellent product that makes realtime communication in NodeJS so much simpler, and it opens up a world of possibilities in terms of applications that can be built. It’s the fourth time I’ve used this library, and each time it’s elegance blows me away. The same holds true for Express. Robust and easy to use (though we didn’t use it heavily for this project). All the text messaging stuff was handled via Twilio, another service that I am a huge fan of.
Heading out to New York tomorrow for the Yahoo Open Hack All Stars 2011. I’m super excited for this event and can’t wait for the competition to begin!
For the curious, the winners were: 1st prize was the capture the flag game, 2nd prize was taken by Linked Out (an application that used data available on Likedin to predict who will change jobs) and the 3rd prize was grabbed by the Django-streaming-file-sharing application (they called themselves Beamit).
#inday, I shall miss you.
A week or two ago I started learning Django, and wrote my first app, a simple contacts book thingy. Right from the get go, I was amazed by how everything felt so natural in Django, at least to me. The MTV pattern seemed really intuitive and I had no problem diving right in and creating an app. The Django documentation is extremely well written and answered all the questions I had while coding. Even though I was creating my first Django app, I had no problems in incorporating generic views, model forms, pagination etc. In order to have database migration support (yes, I kept changing the schema even for a simple app :p) I installed South and everything was smooth sailing from there. The last thing I want to add to the app is search capabilities, and for this I’ve decided to use the Haystack application. I’m pretty sure this is overkill for such a simple app, but I wanted to try out this application and hence decided to throw it in.
After working with Django, I’ve decided to go back and give Rails another go. I’ve almost completely forgotten all the concepts from Rails, and I would love to refresh my memory.
Everyone knows and loves Python’s
import function. This function is used to import external modules into the current module/script we are writing. Here are a few simple examples illustrating how the function can be used:
Internally, a call to the
import function makes a call to the built-in
__import__ function. However, there are 2 cases where using
import would not work and the only way out is to call
The most common case where we would have to call
__import__ directly would be when we want to import specific modules at run-time based on user input. Here is how we might do that:
Yes, I know that we didn’t really have to import at run-time in the previous example. It was just meant to be a simple example 🙂
Another case where we might have to call
__import__ is when, for some reason, the parent modules/folders for a module we want to import or a module itself has a name with characters that are not allowed by Python. For example, something like the snippet below would not work:
Notice how the hyphen is not allowed in the module name. Here is how we might work around that: