Creating Data from Text – Regular Expressions in OpenRefine by Tony Hirst.
From the post:
Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).
Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.
More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)
This is a great step-by-step extraction of data example using regular expressions in OpenRefine.
If you don’t know OpenRefine, you should.
Debating possible or potential semantics is one thing.
Extracting, processing, and discovering the semantics of data is another.
In part because the latter is what most clients are willing to pay for. 😉
PS: Using OpenRefine is on sale now in eBook version for $5.00 http://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book A tweet from Packt Publishing says the sale is on through January 3, 2014.