I wanted to follow-up on my previous post about scraping Futurama episode ratings from IMDb. I used tools I was familiar with to get the job done but I was told by someone that I really should check out BeautifulSoup to do it all in Python. It ended up working great and I’ll continue to use BeautifulSoup for web scraping in the future. This is what I did in the IPython interpreter:

import re, requests
import numpy as np
import pandas as pd
import scipy.stats as stats
from bs4 import BeautifulSoup
# create soup object
r = requests.get("http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2")
soup = BeautifulSoup(r.content)
# scrape scores
scores = []
for score in soup.find_all("td", {"align": "right", "bgcolor": "#eeeeee"}):
scores.append(float(score.get_text().strip()))
# scrape episodes
titles = []
for title in soup.find_all("a", {"href": re.compile("\/title\/tt")}):
if len(title["href"]) == 17:
titles.append(title.get_text().strip())
cols = ["IMDb Rating"]
# build dataframe
frame = pd.DataFrame(scores, titles, cols)
# maths with numpy
np.std(scores)
np.mean(scores)
# maths with pandas
s = pd.Series(scores)
s.std()
s.mean()
s.describe()
pd.Series.describe(frame)
# test for normal distribution
stats.normaltest(scores)

The ISTA 350: Programming for Informatics Applications course at the University of Arizona helped me a lot after my initial post. Additionally, the book Web Scraping with Python by Ryan Mitchell is one I’d recommend keeping handy.

### Like this:

Like Loading...