Archive for the ‘OpenRefine’ Category

Creating Data from Text…

Sunday, December 22nd, 2013

Creating Data from Text – Regular Expressions in OpenRefine by Tony Hirst.

From the post:

Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).

Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.

More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)

This is a great step-by-step extraction of data example using regular expressions in OpenRefine.

If you don’t know OpenRefine, you should.

Debating possible or potential semantics is one thing.

Extracting, processing, and discovering the semantics of data is another.

In part because the latter is what most clients are willing to pay for. 😉

PS: Using OpenRefine is on sale now in eBook version for $5.00 A tweet from Packt Publishing says the sale is on through January 3, 2014.

Cleaning Data with OpenRefine

Tuesday, August 20th, 2013

Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh, and, Max De Wilde.

From the post:

Don’t take your data at face value. That is the key message of this tutorial which focuses on how scholars can diagnose and act upon the accuracy of data. In this lesson, you will learn the principles and practice of data cleaning, as well as how OpenRefine can be used to perform four essential tasks that will help you to clean your data:

  1. Remove duplicate records
  2. Separate multiple values contained in the same field
  3. Analyse the distribution of values throughout a data set
  4. Group together different representations of the same reality

These steps are illustrated with the help of a series of exercises based on a collection of metadata from the Powerhouse museum, demonstrating how (semi-)automated methods can help you correct the errors in your data.


If you only remember on thing from this lesson, it should be this: all data is dirty, but you can do something about it. As we have shown here, there is already a lot you can do yourself to increase data quality significantly. First of all, you have learned how you can get a quick overview of how many empty values your dataset contains and how often a particular value (e.g. a keyword) is used throughout a collection. This lessons also demonstrated how to solve recurrent issues such as duplicates and spelling inconsistencies in an automated manner with the help of OpenRefine. Don’t hesitate to experiment with the cleaning features, as you’re performing these steps on a copy of your data set, and OpenRefine allows you to trace back all of your steps in the case you have made an error.

It is so rare that posts have strong introductions and conclusions that I had to quote both of them.

Great introduction to OpenRefine.

I fully agree that all data is dirty, and that you can do something about it.

However, data is dirty or clean only from a certain point of view.

You may “clean” data in a way that makes it incompatible with my input methods. For me, the data remains “dirty.”

Or to put it another way, data cleaning is like housekeeping. It comes around day after day. You may as well plan for it.