Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 20, 2013

Cleaning Data with OpenRefine

Filed under: Data Quality,OpenRefine — Patrick Durusau @ 2:54 pm

Cleaning Data with OpenRefine by Seth van Hooland, Ruben Verborgh, and, Max De Wilde.

From the post:

Don’t take your data at face value. That is the key message of this tutorial which focuses on how scholars can diagnose and act upon the accuracy of data. In this lesson, you will learn the principles and practice of data cleaning, as well as how OpenRefine can be used to perform four essential tasks that will help you to clean your data:

  1. Remove duplicate records
  2. Separate multiple values contained in the same field
  3. Analyse the distribution of values throughout a data set
  4. Group together different representations of the same reality

These steps are illustrated with the help of a series of exercises based on a collection of metadata from the Powerhouse museum, demonstrating how (semi-)automated methods can help you correct the errors in your data.

(…)

If you only remember on thing from this lesson, it should be this: all data is dirty, but you can do something about it. As we have shown here, there is already a lot you can do yourself to increase data quality significantly. First of all, you have learned how you can get a quick overview of how many empty values your dataset contains and how often a particular value (e.g. a keyword) is used throughout a collection. This lessons also demonstrated how to solve recurrent issues such as duplicates and spelling inconsistencies in an automated manner with the help of OpenRefine. Don’t hesitate to experiment with the cleaning features, as you’re performing these steps on a copy of your data set, and OpenRefine allows you to trace back all of your steps in the case you have made an error.

It is so rare that posts have strong introductions and conclusions that I had to quote both of them.

Great introduction to OpenRefine.

I fully agree that all data is dirty, and that you can do something about it.

However, data is dirty or clean only from a certain point of view.

You may “clean” data in a way that makes it incompatible with my input methods. For me, the data remains “dirty.”

Or to put it another way, data cleaning is like housekeeping. It comes around day after day. You may as well plan for it.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress