7 command-line tools for data science by Jeroen Janssens.
From the post:
Data science is OSEMN (pronounced as awesome). That is, it involves Obtaining, Scrubbing, Exploring, Modeling, and iNterpreting data. As a data scientist, I spend quite a bit of time on the command-line, especially when there's data to be obtained, scrubbed, or explored. And I'm not alone in this. Recently, Greg Reda discussed how the classics (e.g., head, cut, grep, sed, and awk) can be used for data science. Prior to that, Seth Brown discussed how to perform basic exploratory data analysis in Unix.
I would like to continue this discussion by sharing seven command-line tools that I have found useful in my day-to-day work. The tools are: jq, json2csv, csvkit, scrape, xml2json, sample, and Rio. (The home-made tools
scrape
,sample
, andRio
can be found in this data science toolbox.) Any suggestions, questions, comments, and even pull requests are more than welcome.
Jeroen covers:
- jq – sed for JSON
- json2csv – convert JSON to CSV
- csvkit – suite of utilities for converting to and working with CSV
- scrape – HTML extraction using XPath or CSS selectors
- xml2json – convert XML to JSON
- sample – when you’re in debug mode
- Rio – making R part of the pipeline
There are fourteen (14) more suggested by readers at the bottom of the post.
Some definite additions to the tool belt here.
I first saw this in Pete Warden’s Five Short Links, October 19, 2013.