Archive for the ‘Awk’ Category

Cops Driving Cabs – Not Just Moonlighting (Awk)

Wednesday, May 25th, 2016

NYPD has at least five undercover ‘Cop Cabs’ by Matthew Guariglia.

Matthew walks you through the process of inferring the New York Police department has at least five (5) vehicles that look like taxi cabs.

Or at least they have taxi cab emblems.

A patrol car with a taxi cab emblem would look out of place.

A good lesson in persistence, asking more than one source and collating information.

Just for grins, I downloaded the Medallion Vehicles – Authorized file as a CSV file, said to contain 14265362 lines and as of today, runs a little over 2 GB.

I was curious about was under what name did the TLC issue cop medallions? Unlikely to have added them to a third-party account because of property tax issues. Would they have made up different owners for each of the five medallions? Or would they use a common owner for all five medallions?

Possible that they created the five medallions “off the books,” but that seems unlikely as well. They would want to tie them to license plates.

First observation on the data: The “name” field appears variously with enclosing quotes and no quotes at all.

For example:

License Number,Name,Expiration Date,Current Status,DMV License Plate Number,
Vehicle VIN Number,Vehicle Type,Model Year,Medallion Type,Agent Number,
Agent Name,Agent Telephone Number,Agent Website Address,Agent Address,
Last Date Updated,Last Time Updated


MUST DRIVE,000,,,,,03/12/2014,13:20
NAMED DRIVER,000,,,,,03/03/2014,13:20
MUST DRIVE,000,,,,,05/24/2014,13:20
WOODSIDE NY 11377,01/21/2014,13:20
MUST DRIVE,0,,,,,07/19/2013,13:20

This data snippet has no significance other than the variation in the name field and the fields of the CSV file.

I used awk to extract the name field to a separate file:

awk 'BEGIN { FS = "," }; { print $2 }' < Medallion__Vehicles_-_Authorized.csv > taxi-names

Then I sorted that file and used uniq plus -c (for count), to create a sorted list of the names with the number of times they occur.

sort < tax-names | uniq -c > taxi-unique-names

You will pickup a lot of data entry errors in this view, extra space in a name, etc.

Then because I am interested in names that occur only five (5) times, I re-sort the file to list names by the number of time they occur (this loses the view that reviews data entry errors):

sort -bn < taxi-unique-names > taxi-by-number

The -bn switches tell sort to ignore leading spaces and to sort in numeric order.

I appreciate New York making this available as “open data” but the interface has a number of limitations.

Another way to approach Matthew’s question is to sort on the addresses, assuming TLC is billing a cop address and not 1060 West Addison. 😉

I haven’t tried this but checking the property tax rolls against the TLC records might be way to ferret out the cop driven taxis. Unless the city has someone paying the taxes for them. Along with the usual graft, who would know?

Other ideas or suggestions to help Matthew flush out these cop driven taxis?

You Can Master the Z Shell (Pointer to How-To)

Friday, March 4th, 2016

Cutting through the toxic atmosphere created by governments around the world requires the sharpest tools and develop of skills at using them.

Unix shells are like a switchblade knife. Not for every job but if you need immediate results, its hard to beat. While you are opening an application, loading files, finding appropriate settings, etc., a quick shell command can have you on your way.

Nacho Caballero writes in Master Your Z Shell with These Outrageously Useful Tips:

If you had previously installed Zsh but never got around to exploring all of its magic features, this post is for you.

If you never thought of using a different shell than the one that came by default when you got your computer, I recommend you go out and check the Z shell. Here are some Linux guides that explain how to install it and set it as your default shell. You probably have Zsh installed you are on a Mac, but there’s nothing like the warm fuzzy feeling of running the latest version (here’s a way to upgrade using Homebrew).

The Zsh manual is a daunting beast. Just the chapter on expansions has 32 subsections. Forget about memorizing this madness in one sitting. Instead, we’ll focus on understanding a few useful concepts, and referencing the manual for additional help.

The three main sections of this post are file picking, variable transformations, and magic tabbing. If you’re pressed for time, read the beginning of each one, and come back later to soak up the details (make sure you stick around for the bonus tips at the end). (emphasis in original)

Would be authors/editors, want to try your hand at the chapter on expansions? Looking at the documentation for Zsh version 5.2, released December 2, 2015, there are 25 numbered subsections for 14 Expansion.

You will be impressed by the number of modifiers/operators available. If you do write a manual for expansions in Zsh, do distribute it widely.

I hope it doesn’t get overlooked by including it here but Nacho also wrote: AWK GTF! How to Analyze a Transcriptome Like a Pro – Part 1 (2 and 3). Awk is another switchblade like tool for your toolkit.

I first saw this in a tweet by Christophe Lalanne.

Command-line tools can be 235x faster than your Hadoop cluster

Wednesday, January 21st, 2015

Command-line tools can be 235x faster than your Hadoop cluster by Adam Drake.

From the post:

As I was browsing the web and catching up on some sites I visit periodically, I found a cool article from Tom Hayden about using Amazon Elastic Map Reduce (EMR) and mrjob in order to compute some statistics on win/loss ratios for chess games he downloaded from the millionbase archive, and generally have fun with EMR. Since the data volume was only about 1.75GB containing around 2 million chess games, I was skeptical of using Hadoop for the task, but I can understand his goal of learning and having fun with mrjob and EMR. Since the problem is basically just to look at the result lines of each file and aggregate the different results, it seems ideally suited to stream processing with shell commands. I tried this out, and for the same amount of data I was able to use my laptop to get the results in about 12 seconds (processing speed of about 270MB/sec), while the Hadoop processing took about 26 minutes (processing speed of about 1.14MB/sec). (emphasis added)

BTW, Adam was using twice as much data as Tom in his analysis.

The lesson here is to not be a one-trick pony as a data scientist. Most solutions, Hadoop, Spark, Titan, can solve most problems. However, anyone who merits the moniker “data scientist” should be able to choose the “best” solution for a given set of circumstances. In some cases that maybe simple shell scripts.

I first saw this in a tweet by Atabey Kaygun.

Advanced Bash-Scripting Guide

Monday, November 18th, 2013

Advanced Bash-Scripting Guide by Mendel Cooper.

I searched for an awk switch recently and ran across what I needed in an appendix to this book.

It is well written and has copious examples.

You can always fire up heavy duty tools but for many text processing tasks, shell scripts along with awk and sed are quite sufficient.

Unix: Counting the number of commas on a line

Friday, November 16th, 2012

Unix: Counting the number of commas on a line by Mark Needham.

From the post:

A few weeks ago I was playing around with some data stored in a CSV file and wanted to do a simple check on the quality of the data by making sure that each line had the same number of fields.

Marks offers two solutions to the problem, but concedes that more may exist.

A good first round sanity check to run on data stored in a CSV file.

Other one-liners you find useful for data analysis?