Archive for the ‘Pig’ Category
Thursday, March 7th, 2013
Million Song Dataset in Minutes! (Video)
Actually 5:35 as per the video.
The summary of the video reads:
Created Web Project [zero install]
Loaded data from S3
Developed in Pig and Python [watch for the drop down menus of pig fragments]
ILLUSTRATE’d our work [perhaps the most impressive feature, tests code against sample of data]
Ran on Hadoop [drop downs to create a cluster]
Downloaded results [50 "densest songs", see the video]
It’s not all “hands free” or without intellectual effort on your part.
But, a major step towards a generally accessible interface for Hadoop/MapReduce data processing.
Posted in Hadoop, MapReduce, Mortar, Pig, Python | No Comments »
Thursday, March 7th, 2013
MortarData2013
Mortar has its own YouTube channel!
Unlike the History Channel, the MotorData2013 channel is educational and entertaining.
I leave it to you to guess whether those two adjectives apply to the History Channel. (Hint: Thirty (30) minutes of any Vikings episode should help you answer.)
Not a lot of data at the moment but what is there, well, I am going to cover one of those in a separate post.
Posted in Hadoop, MapReduce, Mortar, Pig | No Comments »
Friday, March 1st, 2013
Pig Eye for the SQL Guy by Cat Miller.
From the post:
For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.
As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.
Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)
This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
Do you speak SQL?
Want to learn to speak Pig?
This is the right post for you!
Posted in Hadoop, MapReduce, Pig, SQL | No Comments »
Wednesday, February 27th, 2013
Apache Pig: It goes to 0.11
From the post:
After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 — it’s a great way to contribute to open source software.
This blog post hits some of the highlights of the release. Pig users may also find a presentation by Daniel Dai, which includes code and output samples for the new operators, helpful.
And from Hortonworks’ post on the release:
- A DateTime datatype, documentation here.
- A RANK function, documentation here.
- A CUBE operator, documentation here.
- Groovy UDFs, documentation here.
If you remember Robert Barta’s Cartesian expansion of tuples, you will find it in the CUBE operator.
Posted in Hadoop, MapReduce, Pig | No Comments »
Saturday, February 16th, 2013
Pig, ToJson, and Redis to publish data with Flask by Russell Jurney.
From the post:
Pig can easily stuff Redis full of data. To do so, we’ll need to convert our data to JSON. We’ve previously talked about pig-to-json in JSONize anything in Pig with ToJson. Once we convert our data to json, we can use the pig-redis project to load redis.
What do you think?
Something “lite” to test a URI dictionary locally?
Posted in JSON, Pig, Redis | No Comments »
Saturday, February 16th, 2013
Working with Pig by Dan Morrill. (video)
From the description:
Pig is a SQL like command language for use with Hadoop, we review a simple PIG script line by line to help you understand how pig works, and regular expressions to help parse data. If you want a copy of the slide presentation – they are over on slide share http://www.slideshare.net/rmorrill.
Very good intro to PIG!
Mentions a couple of resources you need to bookmark:
Input Validation Cheat Sheet (The Open Web Security Application Project – OWASP) – regexes to re-use in Pig scripts. Lots of other regex cheat sheet pointers. (Being mindful that “\” must be escaped in PIG.)
Regular-Expressions.info A more general resource on regexes.
I first saw this at: This Quick Pig Overview Brings You Up to Speed Line by Line.
Posted in Pig, Regex, Regexes | No Comments »
Wednesday, February 13th, 2013
Imperative and Declarative Hadoop: TPC-H in Pig and Hive by Russell Jurney.
From the post:
According to the Transaction Processing Council, TPC-H is:
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
TPC-H was implemented for Hive in HIVE-600 and for Pig in PIG-2397 by Hortonworks intern Jie Li. In going over this work, I was struck by how it outlined differences between Pig and SQL.
There seems to be tendency for simple SQL to provide greater clarity than Pig. At some point as the TPC-H queries become more demanding, complex SQL seems to have less clarity than the comparable Pig. Lets take a look.
(emphasis in original)
A refresher in the lesson that what solution you need, in this case Hive or PIg, depends upon your requirements.
Use either one blindly at the risk of poor performance or failing to meet other requirements.
Posted in Hadoop, Hive, MapReduce, Pig, TPC-H | No Comments »
Monday, February 11th, 2013
Flatten entire HBase column families with Pig and Python UDFs by Chase Seibert.
From the post:
Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.
How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.
Now there’s an ugly problem.
You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.
Saying: “Don’t do that!” doesn’t help because it is already being done.
If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.
Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?
Suggestions?
I first saw this at Flatten Entire HBase Column Families With Pig and Python UDFs by Alex Popescu.
Posted in HBase, Pig, Python | No Comments »
Thursday, February 7th, 2013
A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.
Alex has assembled links to guides to MapReduce frameworks:
Thanks Alex!
Posted in Hadoop, Hive, MapReduce, Pig, Python, Scalding, Scoobi, Scrunch, Spark | No Comments »
Friday, February 1st, 2013
Topic Discovery With Apache Pig and Mallet
Only one of two posts from this blog in 2012 but it is a useful one.
From the post:
A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is ‘about’. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you’ve done topic discovery on.
Walks through the use of Pig and Mallet on a newsgroup data set.
I have been thinking about getting one of those unlimited download newsgroup accounts.
Maybe I need to go ahead and start building some newsgroup data sets.
Posted in Latent Dirichlet Allocation (LDA), MALLET, Pig | No Comments »
Saturday, January 26th, 2013
DataFu: The WD-40 of Big Data by Sam Shah.
From the post:
If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.
No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank, set operations, and bag operations.
It’s helpful to understand the history of the library. Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came PigUnit, which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have DataFu.
So what can this library do for you? Let’s look at one of the classical examples that showcase the power and flexibility of Pig: sessionizing a click steam.
DataFu
The UDF bag and set operations are likely to be of particular interest.
Posted in DataFu, Hadoop, MapReduce, Pig | No Comments »
Wednesday, January 16th, 2013
Packetpig Finding Zero Day Attacks by Michael Baker.
From the post:
When Russell Jurney and I first teamed up to write these posts we wanted to do something that no one had done before to demonstrate the power of Big Data, the simplicity of Pig and the kind of Big Data Security Analytics we perform at Packetloop. Packetpig was modified to support Amazon’s Elastic Map Reduce (EMR) so that we could process a 600GB set of full packet captures. All that we needed was a canonical Zero Day attack to analyse. We were in luck!
In August 2012 a vulnerability in Oracle JRE 1.7 created huge publicity when it was disclosed that a number of Zero Day attacks had been report to Oracle in April but had still not been addressed in late August 2012. To make matters worse Oracle’s scheduled patch for JRE was months away (October 16). This position subsequently changed and a number of out-of-band patches for JRE were released for what became known as CVE-2012-4681 on the 30th of August.
The vulnerability exposed around 1 Billion systems to exploitation and the exploit was 100% effective on Windows, Mac OSX and Linux. A number of security researchers were already seeing the exploit in the wild as it was incorporated into exploit packs for the delivery of malware.
Interesting tool for packet analysis as well as insight on using Amazon’s EMR to process 600 GB of packets.
Packetpig could be an interesting source of data for creating maps or adding content to maps, based on packet traffic content.
Posted in Pig, Security | No Comments »
Sunday, January 13th, 2013
Apache Pig 0.10.1 Released by Daniel Dai.
From the post:
We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.
Time to update your Pig installation!
Posted in Hadoop, Pig | 1 Comment »
Saturday, January 5th, 2013
Apache Crunch: A Java Library for Easier MapReduce Programming by Josh Wills.
From the post:
Apache Crunch (incubating) is a Java library for creating MapReduce pipelines that is based on Google’s FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig, and Cascading, Crunch provides a library of patterns to implement common tasks like joining data, performing aggregations, and sorting records. Unlike those other tools, Crunch does not impose a single data type that all of its inputs must conform to. Instead, Crunch uses a customizable type system that is flexible enough to work directly with complex data such as time series, HDF5 files, Apache HBase tables, and serialized objects like protocol buffers or Avro records.
Crunch does not try to discourage developers from thinking in MapReduce, but it does try to make thinking in MapReduce easier to do. MapReduce, for all of its virtues, is the wrong level of abstraction for many problems: most interesting computations are made up of multiple MapReduce jobs, and it is often the case that we need to compose logically independent operations (e.g., data filtering, data projection, data transformation) into a single physical MapReduce job for performance reasons.
Essentially, Crunch is designed to be a thin veneer on top of MapReduce — with the intention being not to diminish MapReduce’s power (or the developer’s access to the MapReduce APIs) but rather to make it easy to work at the right level of abstraction for the problem at hand.
Although Crunch is reminiscent of the venerable Cascading API, their respective data models are very different: one simple common-sense summary would be that folks who think about problems as data flows prefer Crunch and Pig, and people who think in terms of SQL-style joins prefer Cascading and Hive.
Brief overview of Crunch and an example (word count) application.
Definitely a candidate for your “big data” tool belt.
Posted in Cascading, Hive, MapReduce, Pig | No Comments »
Friday, October 19th, 2012
What’s New in CDH4.1 Pig by Cheolsoo Park.
From the post:
Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.
Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.
Cheolsoo covers these new features:
- Boolean Data Type
- Nested FOREACH and CROSS
- Ruby UDFs
- LIMIT / SPLIT by Expression
- Default SPLIT Destination
- Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP
- AvroStorage Improvements
Enjoy!
Posted in Cloudera, Hadoop, Pig | No Comments »
Saturday, October 6th, 2012
Applying Parallel Prediction to Big Data by Dan McClary (Principal Product Manager for Big Data and Hadoop at Oracle).
From the post:
One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.
Playing Weatherman
I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.
For the problem at hand, that may be overkill and we can get it solved in an easier way, without understanding Mahout. Something becomes apparent on thinking about the problem: I don’t want my climate model for San Francisco to include the weather data from Providence, RI. Weather is a local problem and we want to model it locally. Therefore what we need is many models across different subsets of data. For the purpose of example, I’d like to model the weather on a state-by-state basis. But if I have to build 50 models sequentially, tomorrow’s weather will have happened before I’ve got a national forecast. Fortunately, this is an area where Pig shines.
Two quick observations:
First, Dan makes my point about your needing the “right” data, which may or may not be the same thing as “big data.” Decide what you want to do before you reach for big iron and data.
Second, I never hear references to the “weatherman” without remembering: “you don’t need to be a weatherman to know which way the wind blows.” (link to the manifesto) If you prefer a softer version, Subterranean Homesick Blues by Bob Dylan.
Posted in Hadoop, Mahout, Oracle, Pig, Weather Data, Weka | No Comments »
Wednesday, October 3rd, 2012
CDH4.1 Now Released! by Charles Zedlewski.
From the post:
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
- Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
- Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
- Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
- Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
- FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
- Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
- Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.
It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.
Enjoy!
Posted in Cloudera, Flume, HBase, HDFS, Hadoop, Hive, Pig | No Comments »
Monday, October 1st, 2012
Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig by Russell Jurney.
From the post:
In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.
In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:
Any Pig macros in your trick bag?
Posted in Pig, TF-IDF | No Comments »
Thursday, September 27th, 2012
JSONize Anything in Pig with ToJson by Russell Jurney.
The critical bit reads:
That is precisely what the ToJson method of pig-to-json does. It takes a bag or tuple or nested combination thereof and returns a JSON string.
See Russell’s post for the details.
Posted in JSON, Pig | No Comments »
Sunday, September 23rd, 2012
Pig Out to Hadoop with Alan Gates (Link to the webinar page at Hortonworks. Scroll down for this webinar. You have to register/login to view.)
From the description:
Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.
I should have been watching more closely for this webinar recording to get posted.
Not only is it a great webinar on Pig, but it will restore your faith in webinars as a means of content delivery.
I have suffered through several lately where introductions took more time than actual technical content of the webinar.
Hard to know until you have already registered and spent time expecting substantive content.
Is there a public tally board for webinars on search, semantics, big data, etc.?
Posted in Hadoop, Hortonworks, Pig | 1 Comment »
Thursday, September 20th, 2012
HCatalog Meetup at Twitter by Russell Jurney.
From the post:
Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.
Apache HCatalog is a table and storage management service for data created using Apache Hadoop.
See Russell’s post for more details.
Then brush up on HCatalog (if you aren’t already following it).
Posted in HCatalog, Hadoop, Pig | No Comments »
Thursday, September 20th, 2012
Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask by Russell Jurney.
From the post:
Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.
Working code for this post as well as setup instructions for the tools we use and their environment variables are available at https://github.com/rjurney/enron-python-flask-cassandra-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.
Part one and two can get you started using Pig if you’re not familiar.
With this post in the series, “duct tape,” made it into the title.
In case you don’t know (I didn’t), Flask is a “lightweight web application framework in Python.”
Just once I would like to see a “heavyweight, cumbersome, limited and annoying web application framework in (insert language of your choice).”
Just for variety.
Rather than characterizing software, say what it does.
Sorry, I have been converting one of the most poorly edited documents I have ever seen into a csv file. Proofing will follow the conversion process but hope to finish that by the end of next week.
Posted in Cassandra, Pig | No Comments »
Tuesday, September 11th, 2012
Analyzing Big Data with Twitter
Not really with Twitter but with tools sponsored/developed/used by Twitter. Lecture series at the UC Berkeley School of Information.
Videos of lectures are posted online.
Check out the syllabus for assignments and current content.
Four (4) lectures so far!
- Big Data Analytics with Twitter – Marti Hearst & Gilad Mishne. Introduction to Twitter in general.
- Twitter Philosophy and Software Architecture – Othman Laraki & Raffi Krikorian.
- Introduction to Hadoop – Bill Graham.
- Apache Pig – Jon Coveney
… more to follow.
Posted in BigData, CS Lectures, MapReduce, Pig | No Comments »
Thursday, September 6th, 2012
Meet the Committer, Part One: Alan Gates by Kim Truong.
From the post:
Series Introduction
Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.
What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.
Alan Gates, Apache Pig and HCatalog Committer
Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.
Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.
My only complaint is that the interview is too short!
Looking forward to the Pig webinar!
Posted in Hadoop, Hortonworks, MapReduce, Pig | No Comments »
Tuesday, September 4th, 2012
Pig Performance and Optimization Analysis by Li Jie.
From the post:
In this post, Hortonworks Intern Li Jie talks about his work this summer on performance analysis and optimization of Apache Pig. Li is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.
If you need to optimize Pig operations, this is a very good starting place.
Be sure to grab a copy of Running TPC-H on Pig by Li Jie, Koichi Ishida, Xuan Wang and Muzhi Zhao, with its “Six Rules of Writing Efficient Pig Scripts.”
Expect to see all three of these authors in DBLP sooner rather than later.
DBLP: Shivnath Babu
Posted in Hortonworks, Pig | No Comments »
Thursday, August 30th, 2012
Recap of the August Pig Hackathon at Hortonworks by Russell Jurney.
From the post:
The August Pig Hackathon brought Pig users from Hortonworks, Yahoo, Cloudera, Visa, Kaiser Permanente, and LinkedIn to Hortonworks HQ in Sunnyvale, CA to talk and work on Apache Pig.
If you weren’t at this hackathon, Russell’s summary and pointers will make you want to attend the next one!
BTW, someone needs to tell Michael Sperberg-McQueen that Pig is being used to build generic DAG structures. Don’t worry, he’ll understand.
Posted in Hortonworks, Pig | No Comments »
Monday, August 27th, 2012
Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra by Russell Jurney.
From the post:
Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.
When (not if) the next big cache of emails or other “sensitive” documents drops, everyone who has followed this and similar tutorials should be ready.
Posted in HBase, Hadoop, JRuby, Pig | No Comments »
Friday, August 24th, 2012
Process a Million Songs with Apache Pig by Justin Kestelyn.
From the post:
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.
Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).
An example of offering the reader their choice of implementation detail, on or off a cloud.
Suspect that is going to become increasingly common.
Posted in Amazon Web Services AWS, Cloudera, Data Mining, Hadoop, Pig | No Comments »
Thursday, August 16th, 2012
Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js by Russell Jurney.
From the post:
Series Introduction
Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.
Introduction
In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.
I was tempted to add ‘duct tape’ as a category. But there could only be one entry.
Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.
Posted in Hadoop, MongoDB, Pig, node-js | No Comments »