Recently I’d been inspired by watching Moneyball to check out if there basketball data out there that do some cool stuff with. I found play by play data for all basketball games from 2006-2010 from this site, and proceeded to load it into a mysql database.
Of course, things that involve loading and formatting data are always harder than they seem at first glance. Heck, 75% of my job seems to be cleaning peoples data and formatting it in new ways (really intellectually challenging, I tell you…).
Anyway, I opened the zip files and found that each game was its own csv file. With ~1050 games per season this came out to a total of 4,723 csv files!
Normally I’d write a load data script in mysql referencing each file but this obviously would be impractical for so many files. But this gave me a reason to learn some ruby scripting. I’ll summarize briefly my steps for how I loaded all these files in windows.
1. Unpack all the files into a folder
First, I unzipped the 4,723 csv files into a single folder. Then, I used the dir>dir.txt command in windows to create a listing of all the files in that directory. I’ll use this a little later.
2. Create an array in Ruby that contains the filenames
Next, I cleaned up the filenames in mysql and tacked on double quotes and a comma to each file name: “game1″, ;”game2”, ; etc. Once I exported this to a csv file, I had the contents for the filenames array and could simply copy and paste them there.
3. Create a Ruby Method Using a For-Loop that loads each csv into my data table
Now comes the fun part. I wrote a for loop that takes in the array of filenames, where each filename is taken as the name for the game. For each file, I take the filename and paste it into a table containing all the plays for that game. I then insert the gamename into its own field alon with the season name. Then, this game gets loaded into the main data table. I repeat this for all the games and, voila, there it is after about 10 minutes of mysql data crunching.
I still need to normalize the game data table, so maybe I’ll talk about that a little more later. But for now I’m happy that my ruby script made it pretty easy to load the entire 500meg data set.
It leaves me thinking there must be an even easier way to extract all the filenames from that directory so I don’t have to specify the filenames in an array. Maybe I’ll look at that in the future when I have another large set of files to combine into a table…