Welcome to the World of Tomorrow! (Again)

I wanted to follow-up on my previous post about scraping Futurama episode ratings from IMDb. I used tools I was familiar with to get the job done but I was told by someone that I really should check out BeautifulSoup to do it all in Python. It ended up working great and I’ll continue to use BeautifulSoup for web scraping in the future. This is what I did in the IPython interpreter:

import re, requests
import numpy as np
import pandas as pd
import scipy.stats as stats
from bs4 import BeautifulSoup

# create soup object

r = requests.get("http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2")
soup = BeautifulSoup(r.content)

# scrape scores

scores = []
for score in soup.find_all("td", {"align": "right", "bgcolor": "#eeeeee"}):
    scores.append(float(score.get_text().strip()))

# scrape episodes

titles = []
for title in soup.find_all("a", {"href": re.compile("\/title\/tt")}):
    if len(title["href"]) == 17:
        titles.append(title.get_text().strip())

cols = ["IMDb Rating"]

# build dataframe

frame = pd.DataFrame(scores, titles, cols)

# maths with numpy

np.std(scores)
np.mean(scores)

# maths with pandas

s = pd.Series(scores)
s.std()
s.mean()
s.describe()

pd.Series.describe(frame)

# test for normal distribution

stats.normaltest(scores)

The ISTA 350: Programming for Informatics Applications course at the University of Arizona helped me a lot after my initial post. Additionally, the book Web Scraping with Python by Ryan Mitchell is one I’d recommend keeping handy.

Welcome to the World of Tomorrow! (Again)