Archive for the ‘CSV’ Category

Cleaning CSV Data… [Interview Questions?]

Friday, December 11th, 2015

Cleaning CSV Data Using the Command Line and csvkit, Part 1 by Srini Kadamati.

From the post:

The Museum of Modern Art is one of the most influential museums in the world and they have released a dataset on the artworks in their collection. The dataset has some data quality issues, however, and requires cleanup.

In a previous post, we discussed how we used Python and Pandas to clean the dataset. In this post, we’ll learn about how to use the csvkit library to acquire and explore tabular data.

Why the command line?

Great question! When working in cloud data science environments, you sometimes only have access to a server’s shell. In these situations, proficiency with command line data science is a true superpower. As you become more proficient, using the command line for some data science tasks is much quicker than writing a Python script or a Hadoop job. Lastly, the command line has a rich ecosystem of tools and integration into the file system. This makes certain kinds of tasks, especially those involving multiple files, incredibly easy.

Some experience working in the command line is expected for this post. If you’re new to the command line, I recommend checking out our interactive command line course.

csvkit

csvkit is a library optimized for working with CSV files. It’s written in Python but the primary interface is the command line. You can install csvkit using pip:

pip install csvkit

You’ll need this library to follow along with this post.

If you want to be a successful data scientist, may I suggest you follow this series and similar posts on data cleaning techniques?

Reports vary but the general figure is 50% to 90% of the time of a data scientist is spent cleaning data. Report: Data scientists spend bulk of time cleaning up

Being able to clean data, the 50% to 90% of your future duties, may not get you past the data scientist interview.

There are several 100+ data scientist interview question sets that don’t have any questions about data cleaning.

Seriously, not a single question.

I won’t name names in order to protect the silly but can say that SAS does have one data cleaning question out of twenty. Err, that’s 5% for those of you comparing to the duties of a data scientist at 50% to 90%. Of course the others I reviewed, had 0% out of 50% to 90% so they were even worse.

Oh, the SAS question on data cleaning:

Give examples of data cleaning techniques you have used in the past.

You have to wonder about a data science employer who asks so many questions unrelated to the day to day duties of data scientists.

Maybe when asked some arcane question you can ask back:

An when in the last six (6) months has your average data scientist hire used that concept/technique?

It might not land you a job but do you really want to work at a firm that can’t apply data science to its own hiring process?

Data science employers, heal yourselves!

PS: I rather doubt most data science interviewers understand the epistemological assumptions behind most algorithms so you can memorize a bit of that for your interview.

Will convince them customers will believe your success is just short of divine intervention in their problem.

It’s an old but reliable technique.

New CSV on the Web Drafts (CVS From RDF?)

Thursday, April 16th, 2015

Four Drafts Published by the CSV on the Web Working Group

From the post:

The CSV on the Web Working Group has published a new set of Working Drafts, which the group considers feature complete and implementable.

The group is keen to receive comments on these specifications, either as issues on the Group’s GitHub repository or by posting to public-csv-wg-comment target=”_blank”s@w3.org.

The CSV on the Web Working Group would also like to invite people to start implementing these specifications and to donate their test cases into the group’s test suite. Building this test suite, as well as responding to comments, will be the group’s focus over the next couple of months.

Learn more about the CSV on the Web Working Group.

If nothing else, Model for Tabular Data and Metadata on the Web, represents a start on documenting a largely undocumented data format. Perhaps the most common undocumented data format of all. I say that, there may be a dissertation or even a published book that has collected all the CSV variants at a point in time. Sing out if you know of such a volume.

Will be interested to see if the group issues a work product entitled: Generating CSV from RDF on the Web. Our return to data opacity will be complete.

Not that I have any objections to CSV, compact, easy to parse (well, perhaps not correctly but I didn’t say that), widespread, but it isn’t the best format for blind interchange, which by definition includes legacy data. Odd that in a time of practically limitless storage and orders of magnitude faster processing, that data appears to be seeking less documented semantics.

Or is that the users and producers of data prefer data with less documented semantics? I knew an office manager once upon a time who produced reports from unshared cheatsheets using R Writer. Was loathe to share field information with anyone. Seemed like a very sad way to remain employed.

Documenting data semantics isn’t going to obviate the need for the staffs who previously concealed data semantics. Just as data is dynamic so are its semantics which will require the same staffs to continually update and document the changing semantics of data. Same staffs for the most part, just more creative duties.

q – Text as Data

Tuesday, April 7th, 2015

q – Text as Data by Harel Ben-Attia.

From the webpage:

q is a command line tool that allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files).

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs etc. It supports automatic column name and column type detection, and provides full support for multiple encodings.

q’s web site is http://harelba.github.io/q/. It contains everything you need to download and use q in no time.

I’m not looking for an alternative to awk or sed for CSV/TSV files but you may be.

From the examples I suspect it would be “easier” in some sense of the word to teach than either awk or sed.

Give it a try and let me know what you think.

I first saw this in a tweet by Scott Chamberlain.

In Praise of CSV

Wednesday, March 11th, 2015

In Praise of CSV by Waldo Jaquith

From the post:

Comma Separated Values is the file format that open data advocates love to hate. Compared to JSON, CSV is clumsy; compared to XML, CSV is simplistic. Its reputation is as a tired, limited, it’s-better-than-nothing format. Not only is that reputation is undeserved, but CSV should often be your first choice when publishing data.

It’s true—CSV is tired and limited, though superior to not having data, but there’s another side to those coins. One man‘s tired is another man’s established. One man’s limited is another man’s focused. And “better than nothing” is in, fact, better than nothing, which is frequently the alternative to producing CSV.

A bit further on:


The lack of typing makes schemas generally impractical, and as a result validation of field contents is also generally impractical.

There is ongoing work to improve that situation at the CSV on the Web Working Group (W3C). As of today, see: Metadata Vocabulary for Tabular Data, W3C Editor’s Draft 11 March 2015.

The W3C work is definitely a step in the right direction but even if you “know” a field heading or its data type, do you really “know” the semantics of that field? Assume you have a floating point number, is that “pound-seconds” or “newton-seconds?” Mars orbiters really need to know.

Perhaps CSV files are nearly the darkest dark data with a structure. Even with field names and data types, the semantics of any field and/or its relationship to other fields, remains a mystery.

It may be the case that within a week, month or year, someone may remember the field semantics but what of ten (10) years or even one hundred (100) years from now?

CSV on the Web:… [ .csv 5,250,000, .rdf 72,700]

Thursday, January 8th, 2015

CSV on the Web: Metadata Vocabulary for Tabular Data, and Their Conversion to JSON and RDF

From the post:

The CSV on the Web Working Group has published First Public Working Drafts of the Generating JSON from Tabular Data on the Web and the Generating RDF from Tabular Data on the Web documents, and has also issued new releases of the Metadata Vocabulary for Tabular Data and the Model for Tabular Data and Metadata on the Web Working Drafts. A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. Validation, conversion, display, and search of that tabular data requires additional information on that data. The “Metadata vocabulary” document defines a vocabulary for metadata that annotates tabular data, providing such information as datatypes, linkage among different tables, license information, or human readable description of columns. The standard conversion of the tabular data to JSON and/or RDF makes use of that metadata to provide representations of the data for various applications. All these technologies rely on a basic data model for tabular data described in the “Model” document. The Working Group welcomes comments on these documents and on their motivating use cases. Learn more about the Data Activity.

These are working drafts and as such have a number of issues noted in the text of each one. Excellent opportunity to participate in the W3C process.

There aren’t any reliable numbers but searching for “.csv” returns 5,250,000 “hits” and searching on “.rdf” returns 72,700 “hits.”

That sound really low for CSV and doesn’t include all the CSV files on local systems.

Still, I would say that CSV files continue to be important and that this work merits your attention.

Using Load CSV in the Real World

Saturday, November 22nd, 2014

Using Load CSV in the Real World by Nicole White.

From the description:

In this live-coding session, Nicole will demonstrate the process of downloading a raw .csv file from the Internet and importing it into Neo4j. This will include cleaning the .csv file, visualizing a data model, and writing the Cypher query that will import the data. This presentation is meant to make Neo4j users aware of common obstacles when dealing with real-world data in .csv format, along with best practices when using LOAD CSV.

A webinar with substantive content and not marketing pitches! Unusual but it does happen.

A very good walk through importing a CSV file into Neo4j, with some modeling comments along the way and hints of best practices.

The “next” thing for users after a brief introduction to graphs and Neo4j.

The experience will build their confidence and they will learn from experience what works best for modeling their data sets.

Loading CSV files into Neo4j

Wednesday, October 22nd, 2014

Loading CSV files into Neo4j is so easy that it has taken only three (3) posts, so far, to explain the process. This post is a collection of loading CSV into Neo4j references. If you have others, feel free to contribute them and I will add them to this post.

LOAD CSV into Neo4j quickly and successfully by Michael Hunger on Jun 25, 2014.

Note: You can also read an interactive and live version of this blog post as a Neo4j GraphGist.

Since version 2.1 Neo4j provides out-of-the box support for CSV ingestion. The LOAD CSV command that was added to the Cypher Query language is a versatile and powerful ETL tool.

It allows you to ingest CSV data from any URL into a friendly parameter stream for your simple or complex graph update operation, that … conversion.

The June 25, 2014 post has content that is not repeated in the Oct. 18, 2014 post on loading CSV so you will need both posts, or a very fine memory.

Flexible Neo4j Batch Import with Groovy by Michael Hunger on Oct 9, 2014.

You might have data as CSV files to create nodes and relationships from in your Neo4j Graph Database.

It might be a lot of data, like many tens of million lines.

Too much for LOAD CSV to handle transactionally.

Usually you can just fire up my batch-importer and prepare node and relationship files that adhere to its input format requirements.

What follows is advice on when you may want to deviate from the batch-importer defaults and how to do so.

LOAD CVS with SUCCESS by Michael Hunger on Oct 18, 2014.

I have to admit that using our LOAD CSV facility is trickier than you and I would expect.

Several people ran into issues that they could not solve on their own.

My first blog post on LOAD CSV is still valid in it own right, and contains important aspects that I won’t repeat here.

Incomplete so reference LOAD CSV into Neo4j quickly and successfully while reading this post.

Others?

CSV Fingerprints

Thursday, August 21st, 2014

CSV Fingerprints by Victor Powell.

From the post:

CSV is a simple and common format for tabular data that uses commas to separate rows and columns. Nearly every spreadsheet and database program lets users import from and export to CSV. But until recently, these programs varied in how they treated special cases, like when the data itself has a comma in it.

It’s easy to make a mistake when you try to make a CSV file fit a particular format. To make it easier to spot mistakes, I’ve made a “CSV Fingerprint” viewer (named after the “Fashion Fingerprints” from The New York Times’s “Front Row to Fashion Week” interactive ). The idea is to provide a birdseye view of the file without too much distracting detail. The idea is similar to Tufte’s Image Quilts…a qualitative view, as opposed to a rendering of the data in the file themselves. In this sense, the CSV Fingerprint is a sort of meta visualization.

This is very clever. Not only can you test a CSV snippet on the webpage, but the source code is on Github. https://github.com/setosa/csv-fingerprint (source code)

Of course, it does rely on the most powerful image processing system known to date. Err, that would be you. 😉

Pass this along. I can imagine any number of data miners who will be glad you did.

CSV validator – a new digital preservation tool

Tuesday, July 15th, 2014

CSV validator – a new digital preservation tool by David Underdown.

From the post:

Today marks the official release of a new digital preservation tool developed by The National Archives, CSV Validator version 1.0. This follows on from well known tools such as DROID and PRONOM database used in file identification (discussed in several previous blog posts). The release comprises the validator itself, but perhaps more importantly, it also includes the formal specification of a CSV schema language for describing the allowable content of fields within CSV (Comma Separated Value) files, which gives something to validate against.

Odd to find two presentations about CSV on the same day!

Adam Retter presented on this project today. slides.

It will be interesting to see how much cross-pollination occurs with the CSV on the Web Working Group.

Suggest you follow both groups.

CSV on the Web

Tuesday, July 15th, 2014

CSV on the Web – What’s Happening in the W3C Working Group by Jeni Tennison.

After seeing Software Carpentry: Lessons Learned yesterday, I have a new appreciation for documenting the semantics of data as used by its users.

Not to say we don’t need specialized semantic syntaxes and technologies, but if we expect market share, then we need to follow the software and data users are using.

How important is CSV?

Jeni gives that stats as:

  • >90% open data is tabular
  • 2/3rds “CSV” files on data.gov.uk aren’t machine readable

Which means people use customized solutions (read vendor lockin).

A good overview of the CSV WG’s work so far with a request for your assistance:

I need to start following this workgroup. Curious to see if they reuse XQuery addressing to annotate CSV files, columns, rows, cells.

PS: If you don’t see arrows in the presentation, I didn’t, use your space bar to change slides and Esc to see all the slides.

Importing CSV data into Neo4j…

Thursday, June 12th, 2014

Importing CSV data into Neo4j to make a graph by Samantha Zeitlin.

From the post:

Thanks to a friend who wants to help more women get into tech careers, last year I attended Developer Week, where I was impressed by a talk about Neo4j.

Graph databases excited me right away, since this is a concept I’ve used for brainstorming since 3rd grade, when my teachers Mrs. Nysmith and Weaver taught us to draw webbings as a way to take notes and work through logic puzzles.

Samantha is successful at importing CSV data into Neo4j but only after encountering an out-dated blog post, a stack overflow example and then learning there is a new version of the importer available.

True, many of us learned *nix from the man pages but while effective, I can’t really say it was an efficient way to learn *nix.

Most people have a task for your software. They are not seeking to mind meld with it or to take it up as a new religion.

Emphasize the ease of practical use of your software and you will gain devotees despite it being easy to use.

CSV on the Web

Thursday, March 27th, 2014

CSV on the Web Use Cases and Requirements, and Model for Tabular Data and Metadata Published

I swear, that really is the title.

Two recent drafts of interest:

The CSV on the Web: Use Cases and Requirements collects use cases that are at the basis of the work of the Working Group. A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. The Working Group aim to specify technologies that provide greater interoperability for data dependent applications on the Web when working with tabular datasets comprising single or multiple files using CSV, or similar, format. This document lists a first set of use cases compiled by the Working Group that are considered representative of how tabular data is commonly used within data dependent applications. The use cases observe existing common practice undertaken when working with tabular data, often illustrating shortcomings or limitations of existing formats or technologies. This document also provides a first set of requirements derived from these use cases that have been used to guide the specification design.

The Model for Tabular Data and Metadata on the Web outlines a basic data model, or infoset, for tabular data and metadata about that tabular data. The document contains first drafts for various methods of locating metadata: one of the output the Working Group is chartered for is to produce a metadata vocabulary and standard method(s) to find such metadata. It also contains some non-normative information about a best practice syntax for tabular data, for mapping into that data model, to contribute to the standardisation of CSV syntax by IETF (as a possible update of RFC4180).

I guess they mean to use CSV as it exists? What a radical concept. 😉

What next?

Could use an updated specification for the COBOL data format in which many government data sets are published (even now).

That last statement isn’t entirely in jest. There is a lot of COBOL formatted files on government websites in particular.

Semantic Web Rides Into the Sunset

Wednesday, December 11th, 2013

W3C’s Semantic Web Activity Folds Into New Data Activity by Jennifer Zaino.

From the post:

The World Wide Web Consortium has headline news today: The Semantic Web, as well as eGovernment, Activities are being merged and superseded by the Data Activity, where Phil Archer serves as Lead. Two new workgroups also have been chartered: CSV on the Web and Data on the Web Best Practices.

The new CSV on the Web Working Group is an important step in that direction, following on the heels of efforts such as R2RML. It’s about providing metadata about CSV files, such as column headings, data types, and annotations, and, with it, making it easily possible to convert CSV into RDF (or other formats), easing data integration. “The working group will define a metadata vocabulary and then a protocol for how to link data to metadata (presumably using HTTP Link headers) or embed the metadata directly. Since the links between data and metadata can work in either direction, the data can come from an API that returns tabular data just as easily as it can a static file,” says Archer. “It doesn’t take much imagination to string together a tool chain that allows you to run SPARQL queries against ’5 Star Data’ that’s actually published as a CSV exported from a spreadsheet.”

The Data on the Web Best Practices working group, he explains, will not define any new technologies but will guide data publishers (government, research scientists, cultural heritage organizations) to better use the Web as a data platform. Additionally, the Data Activity, as well as the new Digital Publishing Activity that will be lead by former Semantic Web Activity Lead Ivan Herman, are now in a new domain called the Information and Knowledge Domain (INK), led by Ralph Swick.

I will spare you all the tedious justification by Phil Archer of the Semantic Web venture.

The W3C is also the home of XSLT, XPath, XQuery, and other standards that require no defense or justification.

Maybe we will all get lucky and the CSV on the Web and Data on the Web Best Practices activities will be successful activities at the W3C.

DBpedia as Tables [21st Century Interchange]

Monday, November 25th, 2013

DBpedia as Tables

From the webpage:

As some of the potential users of DBpedia might not be familiar with the RDF data model and the SPARQL query language, we provide some of the core DBpedia 3.9 data also in tabular form as Comma-Separated-Values (CSV) files, which can easily be processed using standard tools, such as spreadsheet applications, relational databases or data mining tools.

For each class in the DBpedia ontology (such as Person, Radio Station, Ice Hockey Player, or Band) we provide a single CSV file which contains all instances of this class. Each instance is described by its URI, an English label and a short abstract, the mapping-based infobox data describing the instance (extracted from the English edition of Wikipedia), and geo-coordinates.

Altogether we provide 530 CVS files in the form of a single ZIP file (size 3 GB compressed and 73.4 GB when uncompressed).

The ZIP file can be downloaded here (3 GB).

😉

I have to admit that I applaud the move to release DBpedia as CSV files.

Despite my long time and continuing allegiance to XML, there are times when CSV is the optimum format for interchange.

You need to mark this date on your calendar.

I am curious what projects will appear using DBpedia data, based on the CSV version, in the next calendar year?

I first saw this in a tweet by Nicolas Torzec.

Linked CSV (Encouraging Development)

Tuesday, May 7th, 2013

Linked CSV by Jeni Tennison.

Abstract:

Many open data sets are essentially tables, or sets of tables, which follow the same regular structure. This document describes a set of conventions for CSV files that enable them to be linked together and to be interpreted as RDF.

An encouraging observation in the draft:

Linked CSV is built around the concept of using URIs to name things. Every record, column, and even slices of data, in a linked CSV file is addressable using URI Identifiers for the text/csv Media Type. For example, if the linked CSV file is accessed at http://example.org/countries, the first record in the CSV file above, which happens to be the first data line within the linked CSV file (which describes Andorra) is addressable with the URI:

http://example.org/countries#row:0

However, this addressing merely identifies the records within the linked CSV file, not the entities that the record describes. This distinction is important for two reasons:

  • a single entity may be described by multiple records within the linked CSV file
  • addressing entities and records separately enables us to make statements about the source of the information within a particular record

By default, each data line describes an entity, each entity is described by a single data line, and there is no way to address the entities. However, adding a $id column enables entities to be given identifiers. These identifiers are always URIs, and they are interpreted relative to the location of the linked CSV file. The $id column may be positioned anywhere but by convention it should be the first column (unless there is a # column, in which case it should be the second). For example:

Hopefully Jeni is setting a trend in Linked Data circles of distinguishing locations from entities.

I first saw this in Christophe Lalanne’s A bag of tweets / April 2013.

Splitting a Large CSV File into…

Monday, April 8th, 2013

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column by Tony Hirst.

From the post:

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Just because data is “open,” doesn’t mean it will be easy to use. (Leaving the useful question to one side.)

We have been kicking around idea for a “killer” topic map application.

What about a plug-in for a browser that recognizes file types and suggests tools for processing them?

I am unlikely to remember this post a year from now when I have a CSV file from some site.

But if a browser plugin recognized the extension, .csv, and suggested a list of tools for exploring it….

Particularly if the plug-in called upon some maintained site of tools, so the list of tools is maintained.

Or for that matter, that it points to other data explorers who have examined the same file (voluntary disclosure).

Not the full monty of topic maps but a start towards collectively enhancing our experience with data files.

Importing CSV Data into Neo4j

Tuesday, January 29th, 2013

A Python utility for importing CSV data into a Neo4j database. neo4j-table-data.

Update: TabLinker & UnTabLinker

Monday, November 19th, 2012

Update: TabLinker & UnTabLinker

From the post:

TabLinker, introduced in an earlier post, is a spreadsheet to RDF converter. It takes Excel/CSV files as input, and produces enriched RDF graphs with cell contents, properties and annotations using the DataCube and Open Annotation vocabularies.

TabLinker interprets spreadsheets based on hand-made markup using a small set of predefined styles (e.g. it needs to know what the header cells are). Work package 6 is currently investigating whether and how we can perform this step automatically.

Features:

  • Raw, model-agnostic conversion from spreadsheets to RDF
  • Interactive spreadsheet marking within Excel
  • Automatic annotation recognition and export with OA
  • Round-trip conversion: revive the original spreadsheet files from the produced RDF (UnTabLinker)

Even with conversion tools, the question has to be asked:

What was gained by the conversion? Yes, yes the data is now an RDF graph but what can I do now that I could not do before?

With the caveat that it has to be something I want to do.

Unix: Counting the number of commas on a line

Friday, November 16th, 2012

Unix: Counting the number of commas on a line by Mark Needham.

From the post:

A few weeks ago I was playing around with some data stored in a CSV file and wanted to do a simple check on the quality of the data by making sure that each line had the same number of fields.

Marks offers two solutions to the problem, but concedes that more may exist.

A good first round sanity check to run on data stored in a CSV file.

Other one-liners you find useful for data analysis?

Java: Parsing CSV files

Sunday, September 23rd, 2012

Java: Parsing CSV files by Mark Needham

Mark is switching to OpenCSV.

See his post for how he is using OpenCSV and other info.

Batch Importer – Neo4j

Wednesday, March 7th, 2012

By Max De Marzi.

From part 1:

Data is everywhere… all around us, but sometimes the medium it is stored in can be a problem when analyzing it. Chances are you have a ton of data sitting around in a relational database in your current application… or you have begged, borrowed or scraped to get the data from somewhere and now you want to use Neo4j to find how this data is related.

Batch Importer – Part 1: CSV files.

Batch Importer – Part 2: Use of SQL to prepare files for import.

What other importers would you need for Neo4j? Or would you use CSV as a target format for loading into Neo4j?

csvkit 0.4.2 (beta)

Thursday, January 19th, 2012

csvkit 0.4.2 (beta)

From the webpage:

csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.

It is inspired by pdftk, gdal and the original csvcut utility by Joe Germuska and Aaron Bycoffe.

Important links:

Something for the toolbox that may prove to be useful.