Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 9, 2012

Hadoop Streaming Made Simple using Joins and Keys with Python

Filed under: Hadoop,Python,Stream Analytics — Patrick Durusau @ 10:48 am

Hadoop Streaming Made Simple using Joins and Keys with Python

From the post:

There are a lot of different ways to write MapReduce jobs!!!

Sample code for this post https://github.com/joestein/amaunet

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Interesting post and good tips on data exploration. Can’t really query/process the unknown.

Suggestions of other data exploration examples? (Not so much processing the known but looking to “learn” about data sources.)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress