I wanted to follow-up on my previous post about scraping Futurama episode ratings from IMDb. I used tools I was familiar with to get the job done but I was told by someone that I really should check out BeautifulSoup to do it all in Python. It ended up working great and I’ll continue to use BeautifulSoup for web scraping in the future. This is what I did in the IPython interpreter:

import re, requests import numpy as np import pandas as pd import scipy.stats as stats from bs4 import BeautifulSoup # create soup object r = requests.get("http://www.imdb.com/title/tt0149460/eprate?ref_=ttep_sa_2") soup = BeautifulSoup(r.content) # scrape scores scores = [] for score in soup.find_all("td", {"align": "right", "bgcolor": "#eeeeee"}): scores.append(float(score.get_text().strip())) # scrape episodes titles = [] for title in soup.find_all("a", {"href": re.compile("\/title\/tt")}): if len(title["href"]) == 17: titles.append(title.get_text().strip()) cols = ["IMDb Rating"] # build dataframe frame = pd.DataFrame(scores, titles, cols) # maths with numpy np.std(scores) np.mean(scores) # maths with pandas s = pd.Series(scores) s.std() s.mean() s.describe() pd.Series.describe(frame) # test for normal distribution stats.normaltest(scores)

The ISTA 350: Programming for Informatics Applications course at the University of Arizona helped me a lot after my initial post. Additionally, the book Web Scraping with Python by Ryan Mitchell is one I’d recommend keeping handy.