Tuesday, May 7th, 2013



Many open data sets are essentially tables, or sets of tables, which follow the same regular structure. This document describes a set of conventions for CSV files that enable them to be linked together and to be interpreted as RDF.



Linked CSV is built around the concept of using URIs to name things. Every record, column, and even slices of data, in a linked CSV file is addressable using URI Identifiers for the text/csv Media Type. For example, if the linked CSV file is accessed at http://example.org/countries, the first record in the CSV file above, which happens to be the first data line within the linked CSV file (which describes Andorra) is addressable with the URI:

http://example.org/countries#row:0

However, this addressing merely identifies the records within the linked CSV file, not the entities that the record describes. This distinction is important for two reasons:

• a single entity may be described by multiple records within the linked CSV file
• addressing entities and records separately enables us to make statements about the source of the information within a particular record

By default, each data line describes an entity, each entity is described by a single data line, and there is no way to address the entities. However, adding a $id column enables entities to be given identifiers. These identifiers are always URIs, and they are interpreted relative to the location of the linked CSV file. The $id column may be positioned anywhere but by convention it should be the first column (unless there is a # column, in which case it should be the second). For example:

Hopefully Jeni is setting a trend in Linked Data circles of distinguishing locations from entities.

I first saw this in Christophe Lalanne’s A bag of tweets / April 2013.

### Splitting a Large CSV File into…

Monday, April 8th, 2013



One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Just because data is “open,” doesn’t mean it will be easy to use. (Leaving the useful question to one side.)

We have been kicking around idea for a “killer” topic map application.

What about a plug-in for a browser that recognizes file types and suggests tools for processing them?

I am unlikely to remember this post a year from now when I have a CSV file from some site.

But if a browser plugin recognized the extension, .csv, and suggested a list of tools for exploring it….

Particularly if the plug-in called upon some maintained site of tools, so the list of tools is maintained.

Or for that matter, that it points to other data explorers who have examined the same file (voluntary disclosure).

Not the full monty of topic maps but a start towards collectively enhancing our experience with data files.

### Importing CSV Data into Neo4j

Tuesday, January 29th, 2013

A Python utility for importing CSV data into a Neo4j database. neo4j-table-data.

Monday, November 19th, 2012



TabLinker, introduced in an earlier post, is a spreadsheet to RDF converter. It takes Excel/CSV files as input, and produces enriched RDF graphs with cell contents, properties and annotations using the DataCube and Open Annotation vocabularies.

TabLinker interprets spreadsheets based on hand-made markup using a small set of predefined styles (e.g. it needs to know what the header cells are). Work package 6 is currently investigating whether and how we can perform this step automatically.



• Raw, model-agnostic conversion from spreadsheets to RDF
• Interactive spreadsheet marking within Excel
• Automatic annotation recognition and export with OA
• Round-trip conversion: revive the original spreadsheet files from the produced RDF (UnTabLinker)

Even with conversion tools, the question has to be asked:

What was gained by the conversion? Yes, yes the data is now an RDF graph but what can I do now that I could not do before?

With the caveat that it has to be something I want to do.

### Unix: Counting the number of commas on a line

Friday, November 16th, 2012

Unix: Counting the number of commas on a line by Mark Needham.



A few weeks ago I was playing around with some data stored in a CSV file and wanted to do a simple check on the quality of the data by making sure that each line had the same number of fields.

Marks offers two solutions to the problem, but concedes that more may exist.

A good first round sanity check to run on data stored in a CSV file.

Other one-liners you find useful for data analysis?

### Java: Parsing CSV files

Sunday, September 23rd, 2012

Java: Parsing CSV files by Mark Needham

Mark is switching to OpenCSV.

See his post for how he is using OpenCSV and other info.

### Batch Importer – Neo4j

Wednesday, March 7th, 2012

By Max De Marzi.



Data is everywhere… all around us, but sometimes the medium it is stored in can be a problem when analyzing it. Chances are you have a ton of data sitting around in a relational database in your current application… or you have begged, borrowed or scraped to get the data from somewhere and now you want to use Neo4j to find how this data is related.

Batch Importer – Part 1: CSV files.

Batch Importer – Part 2: Use of SQL to prepare files for import.

What other importers would you need for Neo4j? Or would you use CSV as a target format for loading into Neo4j?

### csvkit 0.4.2 (beta)

Thursday, January 19th, 2012

csvkit 0.4.2 (beta)



csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.

It is inspired by pdftk, gdal and the original csvcut utility by Joe Germuska and Aaron Bycoffe.