Introducing DataFu: an open source collection of useful Apache Pig UDFs
From the post:
At LinkedIn, we make extensive use of Apache Pig for performing data analysis on Hadoop. Pig is a simple, high-level programming language that consists of just a few dozen operators and makes it easy to write MapReduce jobs. For more advanced tasks, Pig also supports User Defined Functions (UDFs), which let you integrate custom code in Java, Python, and JavaScript into your Pig scripts.
Over time, as we worked on data intensive products such as People You May Know and Skills, we developed a large number of UDFs at LinkedIn. Today, I’m happy to announce that we have consolidated these UDFs into a single, general-purpose library called DataFu and we are open sourcing it under the Apache 2.0 license:
DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a comprehensive suite of tests. Read on to learn more.
This is way cool!
Read the rest of Matthew’s post (link above) or get thee to GitHub!