4 September 2016

Research Data Thing 21/23 - Tools of the (dirty data) trade

Thing 21 is about dirty data and some strategies and tools for fixing data issues.

Having been involved in implementing data systems at work which involved data migration and establishing feeds from other systems with transformations eg. building an organisation code structure in a new system based on partial strings from a payroll system; sourcing person records from two separate systems and deduplicating (people who were both staff and students), the pitfalls of dirty data is quite familiar. The problems soon started appearing during testing phase, particularly as we looked at report generation and business processes that relied on choosing a specific record.

One of the difficulties was individuals that had name variations between the two systems but were in fact the same person. Sometimes the only way these were found was through someone knowing that staff member had changed their name, or used a diminutive in their student record. This led to changing some business processes to help identify persons between the two systems.

This thing talks about using Google Spreadsheets and a scraping extension to gather data tabular data from websites. In the past, when websites used
tags in the html it was relatively easy to import tables directly into Excel using the method in this video. I was hoping to try it again, but could not find a suitable table to play with. (They mostly seem to use these for ads!, and alternative methods for tabular data)

The feature to do this is available in Excel 2016 in the data ribbon.

This is my first time at trying Google spreadsheets for scraping data. So here is a table from the Wikipedia page on Australia at the Olympics.

Medals by Summer Games

In the wikipedia page the column "Totals" has bold text. In the data scraped the wiki encoding for bold has been captured as asterisks surrounding each value - a prime candidate for some cleansing.

I was going to have a go with openRefine, but it was downloaded on a different computer and I can't be bothered shifting gears to finish this on the other one.