Large Comma Separated Value (CSV) Files
I found myself collecting a large amount of data stored as a comma separated value (CSV) files. Filtering at the time of collection is ideal but not always possible. I began to think about how I work with large CSVs and if I was going about it the best way. (I’ll continue to use large as a way to describe files that are measured in hundreds of megabytes and above.) Speadsheet applications choke with large amounts of data even when it’s organized. Heck, most text editors do as well. It’s not because these programs suck. I’m asking them to do something that they’re not good at or designed for. It’s clear that CSVs of this size are a weakness in those programs. If we look for systems where this weakness is a strength then we can develop a better strategy for querying and pulling meaning out of our data.
It’s no mistake that I used the word query in the pervious sentence. SQLite databases are the solution I picked for my situation. You could do it in MySQL just fine and dandy as well as many other database solutions. I chose SQLite for two reasons: 1. I don’t need multi-user access, and 2. It comes built into OS X and the Linux distributions that I use frequently. Nice and easy. I’ve exported data from SQLite databases to CSVs before and figured that importing would be pretty simple.
In fact, the process of importing a CSV into an SQLite database is Section 8 of the Command Line Shell for SQLite documentation. I also provided an example below.
In this example I am importing an existing CSV (playerstats.csv) into a new table (players) of a new SQLite database. (overwatch.db) Neither the table or the database exist prior to this example.
$ ls -l playerstats.csv $ sqlite3 overwatch.db sqlite> .mode csv sqlite> .import playerstats.csv players sqlite> .exit $ sqlite3 overwatch.db ".schema players" $ sqlite3 overwatch.db "SELECT ROWID, * FROM players ORDER BY ROWID ASC LIMIT 10;"
This is a pretty easy thing to do. It allows me to quickly query the data in a meaningful way when compared to using a spreadsheet application with the original CSV.
I’m beginning to use and learn the R programming language for my statistics class at the University of Arizona. If I find a way to use data from SQLite databases I’ll be sure to post a quick note about it here. It would be interesting to see if there are any performance differences between using the CSV or SQLite database as the source.