Cleaning CSV Data Using the Command Line and csvkit, Part 1 by Srini Kadamati.
From the post:
The Museum of Modern Art is one of the most influential museums in the world and they have released a dataset on the artworks in their collection. The dataset has some data quality issues, however, and requires cleanup.
In a previous post, we discussed how we used Python and Pandas to clean the dataset. In this post, we’ll learn about how to use the
csvkitlibrary to acquire and explore tabular data.
Why the command line?
Great question! When working in cloud data science environments, you sometimes only have access to a server’s shell. In these situations, proficiency with command line data science is a true superpower. As you become more proficient, using the command line for some data science tasks is much quicker than writing a Python script or a Hadoop job. Lastly, the command line has a rich ecosystem of tools and integration into the file system. This makes certain kinds of tasks, especially those involving multiple files, incredibly easy.
Some experience working in the command line is expected for this post. If you’re new to the command line, I recommend checking out our interactive command line course.
csvkitis a library optimized for working with CSV files. It’s written in Python but the primary interface is the command line. You can install
pip install csvkit
You’ll need this library to follow along with this post.
If you want to be a successful data scientist, may I suggest you follow this series and similar posts on data cleaning techniques?
Reports vary but the general figure is 50% to 90% of the time of a data scientist is spent cleaning data. Report: Data scientists spend bulk of time cleaning up
Being able to clean data, the 50% to 90% of your future duties, may not get you past the data scientist interview.
There are several 100+ data scientist interview question sets that don’t have any questions about data cleaning.
Seriously, not a single question.
I won’t name names in order to protect the silly but can say that SAS does have one data cleaning question out of twenty. Err, that’s 5% for those of you comparing to the duties of a data scientist at 50% to 90%. Of course the others I reviewed, had 0% out of 50% to 90% so they were even worse.
Oh, the SAS question on data cleaning:
Give examples of data cleaning techniques you have used in the past.
You have to wonder about a data science employer who asks so many questions unrelated to the day to day duties of data scientists.
Maybe when asked some arcane question you can ask back:
An when in the last six (6) months has your average data scientist hire used that concept/technique?
It might not land you a job but do you really want to work at a firm that can’t apply data science to its own hiring process?
Data science employers, heal yourselves!
PS: I rather doubt most data science interviewers understand the epistemological assumptions behind most algorithms so you can memorize a bit of that for your interview.
Will convince them customers will believe your success is just short of divine intervention in their problem.
It’s an old but reliable technique.