Using ScanSnap Manager to OCR non-ScanSnap PDFs

I had some PDFs that I wanted to perform optical character recognition (OCR) processing on. I have a Fujitsu ScanSnap and wanted to use the ScanSnap Manager software to do this. The management software checks supplied PDFs and will only perform procession on those which originated using ScanSnap hardware. I wanted to circumvent this and it ended up being easy.

PDFs created with a ScanSnap have the Exif tag “creator” with the model string value. You can use ExifTool by Phil Harvey to print and modify Exif data. For example:

$ exiftool -creator ~/example.pdf
Creator                         : ScanSnap Manager #iX500

The file example.pdf has the correct tag/value pair and will be processed. The next file, covfefe.pdf, does not. You can add/modify the tag to the PDF which did not originate from a ScanSnap.

$ exiftool -creator="ScanSnap Manager #iX500" ~/covfefe.pdf 
    1 image files updated
$ exiftool -creator ~/covfefe.pdf
Creator                         : ScanSnap Manager #iX500

Voila! The ScanSnap Manager software will now process the PDF. You can certainly use free OCR software but I didn’t find any of them to be quite a slick. Plus this was more fun. 🙂

Using ScanSnap Manager to OCR non-ScanSnap PDFs

Welcome to the World of Tomorrow! (Again)

I wanted to follow-up on my previous post about scraping Futurama episode ratings from IMDb. I used tools I was familiar with to get the job done but I was told by someone that I really should check out BeautifulSoup to do it all in Python. It ended up working great and I’ll continue to use BeautifulSoup for web scraping in the future. This is what I did in the IPython interpreter:

import re, requests
import numpy as np
import pandas as pd
import scipy.stats as stats
from bs4 import BeautifulSoup

# create soup object

r = requests.get("")
soup = BeautifulSoup(r.content)

# scrape scores

scores = []
for score in soup.find_all("td", {"align": "right", "bgcolor": "#eeeeee"}):

# scrape episodes

titles = []
for title in soup.find_all("a", {"href": re.compile("\/title\/tt")}):
    if len(title["href"]) == 17:

cols = ["IMDb Rating"]

# build dataframe

frame = pd.DataFrame(scores, titles, cols)

# maths with numpy


# maths with pandas

s = pd.Series(scores)


# test for normal distribution


The ISTA 350: Programming for Informatics Applications course at the University of Arizona helped me a lot after my initial post. Additionally, the book Web Scraping with Python by Ryan Mitchell is one I’d recommend keeping handy.

Welcome to the World of Tomorrow! (Again)

Scraping IMDb Futurama Episode User Ratings

Good news, everyone!

This entry is effectively a two-fer. It will show how I used some basic tools and a pinch of Python with numpy to get some of the data I needed for a class project. I took a look at my favorite television show, Futurama. I used the average Internet Movie Database (IMDb) user rating for each episode to see how many standard deviations away from the mean the top four episodes are. The ultimate goal of the project was different but this was a good way to use data to support facts.

Quick. Dirty. Scraping.

IMDb has a page with every Futurama episode and it’s average user rating. The URL is Note that the direct-to-video movies are excluded (rightfully) from this list.

Let’s scrape that data!

$ curl -v 2>&1 | egrep -i 'users rated this' | cut -d' ' -f5 | cut -d'/' -f 1 > /tmp/scores.txt

Oddly enough the OSCP labs had me scrape this way frequently. I didn’t have the time to push the labs hard or take the practical but some information stuck. I’ll hopefully get back to that OSCP soon. 🙂

The above gives us a file (/tmp/scores.txt) with each Futurama episode user score on a new line. All I really want is the mean and standard deviation anyway — It’s easy to do with the Python interpreter.

>>> import numpy as np
>>> scores = []
>>> for line in open('/tmp/scores.txt', 'r'):
...   scores.append(float(line.strip()))
>>> scores = np.array(scores)
>>> np.mean(scores)
>>> np.std(scores)

The mean is ~7.88 and the standard deviation is ~0.59 — I used this information to compare the top four episodes. (The highest rated episode is 2.745 standard deviations away from the mean!)

For those interested here’s a screenshot of the above in action:

Screen Shot 2017-03-28 at 7.30.45 PM

Scraping IMDb Futurama Episode User Ratings

The Amazon Echo Dot has a script

The script’s existence is not proof that it is used but the expanded speculation around it is a fun exercise. The only fact I can give about the /bin/ script is that it exists. (For now.)

I can say something different for another script,, on the system. I audited against it with nmap. I enabled features on the Dot (such as using Spotify to open TCP 4070) to test the script’s execution/logic. The ability to audit the script and observe behavior is crucial. The data supports that the script is used. (More would be better!) The images below are part of that audit; TCP 4070 being open after enabling Spotify and then a quick banner grab.

Unfortunately I’m unable to do the same level of observing with the script. I can’t knowingly trigger it, I don’t have a way to image an Amazon Echo Dot, and I don’t have a way to remotely connect and monitor it’s activity. The script appears to create new memdump logs in the /data/system/dropbox directory. I would love to know the fate of these logs and anything else in the /data/system/dropbox directory.

If you want a copy of the script you can download the system at Amazon Echo Update 567200820 (And where to download it!). Discovery of this script and other fun within the system happened late last year/early 2017. It’s been fun. 🙂

It’s worth noting that recently ArsTechnica ran the story of Amazon refusing to hand over data on whether Alexa overheard a murder, which puts a good perspective on information one could get (possibly) from Amazon about an Echo Dot user if they were motivated to do so. It’s a continuation of the involvement of an Echo in a murder case from 2016.

I wish I had more time to work on this system. Unfortunately taking 19 credits this semester has proven to be the challenge I was expecting. It’s something I still give attention but not at the level of intensity I would like. Hopefully this summer I can focus on it quite a bit more.

The Amazon Echo Dot has a script

Amazon Echo Update 567200820 (And where to download it!)

I received an email from Ronald Brakeboer about an update to the Amazon Echo Dot system. He noticed that his unit updated to 567200820. I wasn’t tracking this and unfortunately didn’t have a copy of the update. It was recent so I decided to pick up another unit and hope that it still needed the update. I could then do what I did previously to download the update for analysis. (I could also unplug the unit and keep it in storage to capture future updates as well.)

Sure enough, the plan worked. I’ve posted a screenshot of the capture along with the download URL and some checksums. I hope it helps!


Download & Checksums

This URL once again works with wget. You don’t have to spoof user-agent strings or anything of that nature. 🙂

SHA1(update-kindle-full_biscuit- 824b94a9664cede9eb2f49ab312fcf66857405ca
SHA512(update-kindle-full_biscuit- a771c05054d33b3e53df4c2a63bdd9a9eda7fbadc11217cb8013bbfa712513f239228f093db72960f5577ed983949dcbf65188850052aafd9776c56bccca6d0a

Additional Reading

If you haven’t yet read my Amazon Echo Dot System Image post then check it out. It goes into greater detail as to what I did. Always feel free to email me of course. Thanks!

Amazon Echo Update 567200820 (And where to download it!)

Vintage Computer Ads

This past weekend I watched the final season of Mad Men. It inspired me to check out some vintage computer advertisements. One site in particular has a nice collection:

A few stand out but I’ll leave it to you to have a favorite. There’s beautiful imagery and a style that seems to have been lost. RiotPSA has to find it’s identity and I think these ads provide great inspiration. Hopefully I can incorporate them (or at least the spirit of them) into future awareness campaigns and announcements.

Vintage Computer Ads