Use multiple CPU Cores with your Linux commands — awk, sed, bzip2, grep, wc, etc.
From the post:
Here’s a common problem: You ever want to add up a very large list (hundreds of megabytes) or grep through it, or other kind of operation that is embarrassingly parallel? Data scientists, I am talking to you. You probably have about four cores or more, but our tried and true tools like grep, bzip2, wc, awk, sed and so forth are singly-threaded and will just use one CPU core. To paraphrase Cartman, “How do I reach these cores”? Let’s use all of our CPU cores on our Linux box by using GNU Parallel and doing a little in-machine map-reduce magic by using all of our cores and using the little-known parameter –pipes (otherwise known as –spreadstdin). Your pleasure is proportional to the number of CPUs, I promise. BZIP2 So, bzip2 is better compression than gzip, but it’s so slow! Put down the razor, we have the technology to solve this.
A very interesting post, particularly if you explore data with traditional Unix tools.
A comment to the post mentions that SSD is being presumed by the article.
Perhaps but learning the technique will be useful for when SSD is standard.
I first saw this in Pete Warden’s Five short links for Thursday, October 31, 2013.