Good news, everyone!
This entry is effectively a two-fer. It will show how I used some basic tools and a pinch of Python with numpy to get some of the data I needed for a class project. I took a look at my favorite television show, Futurama. I used the average Internet Movie Database (IMDb) user rating for each episode to see how many standard deviations away from the mean the top four episodes are. The ultimate goal of the project was different but this was a good way to use data to support facts.
Quick. Dirty. Scraping.
IMDb has a page with every Futurama episode and it’s average user rating. The URL is http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2. Note that the direct-to-video movies are excluded (rightfully) from this list.
Let’s scrape that data!
$ curl -v http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2 2>&1 | egrep -i 'users rated this' | cut -d' ' -f5 | cut -d'/' -f 1 > /tmp/scores.txt
Oddly enough the OSCP labs had me scrape this way frequently. I didn’t have the time to push the labs hard or take the practical but some information stuck. I’ll hopefully get back to that OSCP soon. 🙂
The above gives us a file (/tmp/scores.txt) with each Futurama episode user score on a new line. All I really want is the mean and standard deviation anyway — It’s easy to do with the Python interpreter.
>>> import numpy as np >>> scores =  >>> for line in open('/tmp/scores.txt', 'r'): ... scores.append(float(line.strip())) ... >>> scores = np.array(scores) >>> np.mean(scores) 7.8798387096774185 >>> np.std(scores) 0.58612723471593964
The mean is ~7.88 and the standard deviation is ~0.59 — I used this information to compare the top four episodes. (The highest rated episode is 2.745 standard deviations away from the mean!)
For those interested here’s a screenshot of the above in action: