Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 21, 2013

7 command-line tools for data science

Filed under: Data Mining,Data Science,Extraction — Patrick Durusau @ 4:54 pm

7 command-line tools for data science by Jeroen Janssens.

From the post:

Data science is OSEMN (pronounced as awesome). That is, it involves Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data. As a data scientist, I spend quite a bit of time on the command-line, especially when there's data to be obtained, scrubbed, or explored. And I'm not alone in this. Recently, Greg Reda discussed how the classics (e.g., head, cut, grep, sed, and awk) can be used for data science. Prior to that, Seth Brown discussed how to perform basic exploratory data analysis in Unix.

I would like to continue this discussion by sharing seven command-line tools that I have found useful in my day-to-day work. The tools are: jq, json2csv, csvkit, scrape, xml2json, sample, and Rio. (The home-made tools scrape, sample, and Rio can be found in this data science toolbox.) Any suggestions, questions, comments, and even pull requests are more than welcome.

Jeroen covers:

  1. jq – sed for JSON
  2. json2csv – convert JSON to CSV
  3. csvkit – suite of utilities for converting to and working with CSV
  4. scrape – HTML extraction using XPath or CSS selectors
  5. xml2json – convert XML to JSON
  6. sample – when you’re in debug mode
  7. Rio – making R part of the pipeline

There are fourteen (14) more suggested by readers at the bottom of the post.

Some definite additions to the tool belt here.

I first saw this in Pete Warden’s Five Short Links, October 19, 2013.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress