Welcome to the World of Tomorrow! (Again)

I wanted to follow-up on my previous post about scraping Futurama episode ratings from IMDb. I used tools I was familiar with to get the job done but I was told by someone that I really should check out BeautifulSoup to do it all in Python. It ended up working great and I’ll continue to use BeautifulSoup for web scraping in the future. This is what I did in the IPython interpreter:

import re, requests
import numpy as np
import pandas as pd
import scipy.stats as stats
from bs4 import BeautifulSoup

# create soup object

r = requests.get("http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2")
soup = BeautifulSoup(r.content)

# scrape scores

scores = []
for score in soup.find_all("td", {"align": "right", "bgcolor": "#eeeeee"}):
    scores.append(float(score.get_text().strip()))

# scrape episodes

titles = []
for title in soup.find_all("a", {"href": re.compile("\/title\/tt")}):
    if len(title["href"]) == 17:
        titles.append(title.get_text().strip())

cols = ["IMDb Rating"]

# build dataframe

frame = pd.DataFrame(scores, titles, cols)

# maths with numpy

np.std(scores)
np.mean(scores)

# maths with pandas

s = pd.Series(scores)
s.std()
s.mean()
s.describe()

pd.Series.describe(frame)

# test for normal distribution

stats.normaltest(scores)

The ISTA 350: Programming for Informatics Applications course at the University of Arizona helped me a lot after my initial post. Additionally, the book Web Scraping with Python by Ryan Mitchell is one I’d recommend keeping handy.

Welcome to the World of Tomorrow! (Again)

Scraping IMDb Futurama Episode User Ratings

Good news, everyone!

This entry is effectively a two-fer. It will show how I used some basic tools and a pinch of Python with numpy to get some of the data I needed for a class project. I took a look at my favorite television show, Futurama. I used the average Internet Movie Database (IMDb) user rating for each episode to see how many standard deviations away from the mean the top four episodes are. The ultimate goal of the project was different but this was a good way to use data to support facts.

Quick. Dirty. Scraping.

IMDb has a page with every Futurama episode and it’s average user rating. The URL is http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2. Note that the direct-to-video movies are excluded (rightfully) from this list.

Let’s scrape that data!

$ curl -v http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2 2>&1 | egrep -i 'users rated this' | cut -d' ' -f5 | cut -d'/' -f 1 > /tmp/scores.txt

Oddly enough the OSCP labs had me scrape this way frequently. I didn’t have the time to push the labs hard or take the practical but some information stuck. I’ll hopefully get back to that OSCP soon. 🙂

The above gives us a file (/tmp/scores.txt) with each Futurama episode user score on a new line. All I really want is the mean and standard deviation anyway — It’s easy to do with the Python interpreter.

>>> import numpy as np
>>> scores = []
>>> for line in open('/tmp/scores.txt', 'r'):
...   scores.append(float(line.strip()))
... 
>>> scores = np.array(scores)
>>> np.mean(scores)
7.8798387096774185
>>> np.std(scores)
0.58612723471593964

The mean is ~7.88 and the standard deviation is ~0.59 — I used this information to compare the top four episodes. (The highest rated episode is 2.745 standard deviations away from the mean!)

For those interested here’s a screenshot of the above in action:

Screen Shot 2017-03-28 at 7.30.45 PM

Scraping IMDb Futurama Episode User Ratings