February 1, 2013
Topic Discovery With Apache Pig and Mallet
Only one of two posts from this blog in 2012 but it is a useful one.
From the post:
A common desire when working with natural language is topic discovery. That is, given a set of documents (eg. tweets, blog posts, emails) you would like to discover the topics inherent in those documents. Often this method is used to summarize a large corpus of text so it can be quickly understood what that text is ‘about’. You can go further and use topic discovery as a way to classify new documents or to group and organize the documents you’ve done topic discovery on.
Walks through the use of Pig and Mallet on a newsgroup data set.
I have been thinking about getting one of those unlimited download newsgroup accounts.
Maybe I need to go ahead and start building some newsgroup data sets.
Comments Off on Topic Discovery With Apache Pig and Mallet
January 26, 2013
DataFu: The WD-40 of Big Data by Sam Shah.
From the post:
If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.
No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank, set operations, and bag operations.
It’s helpful to understand the history of the library. Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change. Along came PigUnit, which allowed UDF testing, so we spent the time to clean up these routines by adding documentation and rigorous unit tests. From this “datafoo” package, we thought this would help the community at large, and there you have DataFu.
So what can this library do for you? Let’s look at one of the classical examples that showcase the power and flexibility of Pig: sessionizing a click steam.
DataFu
The UDF bag and set operations are likely to be of particular interest.
Comments Off on DataFu: The WD-40 of Big Data
January 16, 2013
Packetpig Finding Zero Day Attacks by Michael Baker.
From the post:
When Russell Jurney and I first teamed up to write these posts we wanted to do something that no one had done before to demonstrate the power of Big Data, the simplicity of Pig and the kind of Big Data Security Analytics we perform at Packetloop. Packetpig was modified to support Amazon’s Elastic Map Reduce (EMR) so that we could process a 600GB set of full packet captures. All that we needed was a canonical Zero Day attack to analyse. We were in luck!
In August 2012 a vulnerability in Oracle JRE 1.7 created huge publicity when it was disclosed that a number of Zero Day attacks had been report to Oracle in April but had still not been addressed in late August 2012. To make matters worse Oracle’s scheduled patch for JRE was months away (October 16). This position subsequently changed and a number of out-of-band patches for JRE were released for what became known as CVE-2012-4681 on the 30th of August.
The vulnerability exposed around 1 Billion systems to exploitation and the exploit was 100% effective on Windows, Mac OSX and Linux. A number of security researchers were already seeing the exploit in the wild as it was incorporated into exploit packs for the delivery of malware.
Interesting tool for packet analysis as well as insight on using Amazon’s EMR to process 600 GB of packets.
Packetpig could be an interesting source of data for creating maps or adding content to maps, based on packet traffic content.
Comments Off on Packetpig Finding Zero Day Attacks
January 13, 2013
Apache Pig 0.10.1 Released by Daniel Dai.
From the post:
We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.
Time to update your Pig installation!
January 5, 2013
Apache Crunch: A Java Library for Easier MapReduce Programming by Josh Wills.
From the post:
Apache Crunch (incubating) is a Java library for creating MapReduce pipelines that is based on Google’s FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig, and Cascading, Crunch provides a library of patterns to implement common tasks like joining data, performing aggregations, and sorting records. Unlike those other tools, Crunch does not impose a single data type that all of its inputs must conform to. Instead, Crunch uses a customizable type system that is flexible enough to work directly with complex data such as time series, HDF5 files, Apache HBase tables, and serialized objects like protocol buffers or Avro records.
Crunch does not try to discourage developers from thinking in MapReduce, but it does try to make thinking in MapReduce easier to do. MapReduce, for all of its virtues, is the wrong level of abstraction for many problems: most interesting computations are made up of multiple MapReduce jobs, and it is often the case that we need to compose logically independent operations (e.g., data filtering, data projection, data transformation) into a single physical MapReduce job for performance reasons.
Essentially, Crunch is designed to be a thin veneer on top of MapReduce — with the intention being not to diminish MapReduce’s power (or the developer’s access to the MapReduce APIs) but rather to make it easy to work at the right level of abstraction for the problem at hand.
Although Crunch is reminiscent of the venerable Cascading API, their respective data models are very different: one simple common-sense summary would be that folks who think about problems as data flows prefer Crunch and Pig, and people who think in terms of SQL-style joins prefer Cascading and Hive.
Brief overview of Crunch and an example (word count) application.
Definitely a candidate for your “big data” tool belt.
Comments Off on Apache Crunch
October 26, 2012
Information Diffusion on Twitter by @snikolov by Marti Hearst.
From the post:
Today Stan Nikolov, who just finished his masters at MIT in studying information diffusion networks, walked us through one particular theoretical model of information diffusion which tries to predict under what conditions an idea stops spreading based on a network’s structure (from the popular Easley and Kleinberg Network book). Stan also gathered a huge amount of Twitter data, processed it using Pig scripts, and graphed the results using Gephi. The video lecture below shows you some great visualizations of the spreading behavior of the data!
(video omitted)
The slides in his Lecture Notes let you see the Pig scripts in more detail.
Another deeply awesome lecture from Marti’s class on Twitter and big data.
Also an example of the level of analysis that a Twitter stream will need to withstand to avoid “imperial entanglements.”
Comments Off on Information Diffusion on Twitter by @snikolov
October 19, 2012
What’s New in CDH4.1 Pig by Cheolsoo Park.
From the post:
Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.
Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.
Cheolsoo covers these new features:
- Boolean Data Type
- Nested FOREACH and CROSS
- Ruby UDFs
- LIMIT / SPLIT by Expression
- Default SPLIT Destination
- Syntactical Sugar for TOTUPLE, TOBAG, and TOMAP
- AvroStorage Improvements
Enjoy!
Comments Off on What’s New in CDH4.1 Pig
October 6, 2012
Applying Parallel Prediction to Big Data by Dan McClary (Principal Product Manager for Big Data and Hadoop at Oracle).
From the post:
One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.
Playing Weatherman
I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.
For the problem at hand, that may be overkill and we can get it solved in an easier way, without understanding Mahout. Something becomes apparent on thinking about the problem: I don’t want my climate model for San Francisco to include the weather data from Providence, RI. Weather is a local problem and we want to model it locally. Therefore what we need is many models across different subsets of data. For the purpose of example, I’d like to model the weather on a state-by-state basis. But if I have to build 50 models sequentially, tomorrow’s weather will have happened before I’ve got a national forecast. Fortunately, this is an area where Pig shines.
Two quick observations:
First, Dan makes my point about your needing the “right” data, which may or may not be the same thing as “big data.” Decide what you want to do before you reach for big iron and data.
Second, I never hear references to the “weatherman” without remembering: “you don’t need to be a weatherman to know which way the wind blows.” (link to the manifesto) If you prefer a softer version, Subterranean Homesick Blues by Bob Dylan.
Comments Off on Applying Parallel Prediction to Big Data
October 3, 2012
CDH4.1 Now Released! by Charles Zedlewski.
From the post:
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
- Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
- Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
- Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
- Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
- FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
- Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
- Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.
It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.
Enjoy!
Comments Off on CDH4.1 Now Released!
October 1, 2012
Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig by Russell Jurney.
From the post:
In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.
In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:
Any Pig macros in your trick bag?
Comments Off on Pig Macro for TF-IDF Makes Topic Summarization 2 Lines of Pig
September 27, 2012
JSONize Anything in Pig with ToJson by Russell Jurney.
The critical bit reads:
That is precisely what the ToJson method of pig-to-json does. It takes a bag or tuple or nested combination thereof and returns a JSON string.
See Russell’s post for the details.
Comments Off on JSONize Anything in Pig with ToJson
September 23, 2012
Pig Out to Hadoop with Alan Gates (Link to the webinar page at Hortonworks. Scroll down for this webinar. You have to register/login to view.)
From the description:
Pig has added some exciting new features in 0.10, including a boolean type, UDFs in JRuby, load and store functions for JSON, bloom filters, and performance improvements. Join Alan Gates, Hortonworks co-founder and long-time contributor to the Apache Pig and HCatalog projects, to discuss these new features, as well as talk about work the project is planning to do in the near future. In particular, we will cover how Pig can take advantage of changes in Hadoop 0.23.
I should have been watching more closely for this webinar recording to get posted.
Not only is it a great webinar on Pig, but it will restore your faith in webinars as a means of content delivery.
I have suffered through several lately where introductions took more time than actual technical content of the webinar.
Hard to know until you have already registered and spent time expecting substantive content.
Is there a public tally board for webinars on search, semantics, big data, etc.?
September 20, 2012
HCatalog Meetup at Twitter by Russell Jurney.
From the post:
Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.
Apache HCatalog is a table and storage management service for data created using Apache Hadoop.
See Russell’s post for more details.
Then brush up on HCatalog (if you aren’t already following it).
Comments Off on HCatalog Meetup at Twitter
Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask by Russell Jurney.
From the post:
Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.
Working code for this post as well as setup instructions for the tools we use and their environment variables are available at https://github.com/rjurney/enron-python-flask-cassandra-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local. This enables new Hadoop users to try out Pig without a Hadoop cluster.
Part one and two can get you started using Pig if you’re not familiar.
With this post in the series, “duct tape,” made it into the title.
In case you don’t know (I didn’t), Flask is a “lightweight web application framework in Python.”
Just once I would like to see a “heavyweight, cumbersome, limited and annoying web application framework in (insert language of your choice).”
Just for variety.
Rather than characterizing software, say what it does.
Sorry, I have been converting one of the most poorly edited documents I have ever seen into a csv file. Proofing will follow the conversion process but hope to finish that by the end of next week.
Comments Off on Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask
September 11, 2012
Analyzing Big Data with Twitter
Not really with Twitter but with tools sponsored/developed/used by Twitter. Lecture series at the UC Berkeley School of Information.
Videos of lectures are posted online.
Check out the syllabus for assignments and current content.
Four (4) lectures so far!
- Big Data Analytics with Twitter – Marti Hearst & Gilad Mishne. Introduction to Twitter in general.
- Twitter Philosophy and Software Architecture – Othman Laraki & Raffi Krikorian.
- Introduction to Hadoop – Bill Graham.
- Apache Pig – Jon Coveney
… more to follow.
Comments Off on Analyzing Big Data with Twitter
September 6, 2012
Meet the Committer, Part One: Alan Gates by Kim Truong.
From the post:
Series Introduction
Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.
What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.
Alan Gates, Apache Pig and HCatalog Committer
Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.
Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.
My only complaint is that the interview is too short!
Looking forward to the Pig webinar!
Comments Off on Meet the Committer, Part One: Alan Gates
September 4, 2012
Pig Performance and Optimization Analysis by Li Jie.
From the post:
In this post, Hortonworks Intern Li Jie talks about his work this summer on performance analysis and optimization of Apache Pig. Li is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.
If you need to optimize Pig operations, this is a very good starting place.
Be sure to grab a copy of Running TPC-H on Pig by Li Jie, Koichi Ishida, Xuan Wang and Muzhi Zhao, with its “Six Rules of Writing Efficient Pig Scripts.”
Expect to see all three of these authors in DBLP sooner rather than later.
DBLP: Shivnath Babu
Comments Off on Pig Performance and Optimization Analysis
August 30, 2012
Recap of the August Pig Hackathon at Hortonworks by Russell Jurney.
From the post:
The August Pig Hackathon brought Pig users from Hortonworks, Yahoo, Cloudera, Visa, Kaiser Permanente, and LinkedIn to Hortonworks HQ in Sunnyvale, CA to talk and work on Apache Pig.
If you weren’t at this hackathon, Russell’s summary and pointers will make you want to attend the next one!
BTW, someone needs to tell Michael Sperberg-McQueen that Pig is being used to build generic DAG structures. Don’t worry, he’ll understand.
Comments Off on Recap of the August Pig Hackathon at Hortonworks
August 27, 2012
Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra by Russell Jurney.
From the post:
Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.
When (not if) the next big cache of emails or other “sensitive” documents drops, everyone who has followed this and similar tutorials should be ready.
Comments Off on Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra
August 24, 2012
Process a Million Songs with Apache Pig by Justin Kestelyn.
From the post:
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.
Definitely, a much more interesting and efficient approach is to use multiple machines and process the songs in parallel by taking advantage of open-source tools from the Apache Hadoop ecosystem (e.g. Apache Pig). If you have your own machines, you can simply use CDH (Cloudera’s Distribution including Apache Hadoop), which includes the complete Apache Hadoop stack. CDH can be installed manually (quickly and easily by typing a couple of simple commands) or automatically using Cloudera Manager Free Edition (which is Cloudera’s recommended approach). Both CDH and Cloudera Manager are freely downloadable here. Alternatively, you may rent some machines from Amazon with Hadoop already installed and process the data using Amazon’s Elastic MapReduce (here is a cool description writen by Paul Lemere how to use it and pay as low as $1, and here is my presentation about Elastic MapReduce given at the second meeting of Warsaw Hadoop User Group).
An example of offering the reader their choice of implementation detail, on or off a cloud. 😉
Suspect that is going to become increasingly common.
Comments Off on Process a Million Songs with Apache Pig
August 16, 2012
Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js by Russell Jurney.
From the post:
Series Introduction
Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.
But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local
flag: pig -x local
. This enables new Hadoop users to try out Pig without a Hadoop cluster.
Introduction
In this post we’ll be using Hadoop, Pig, mongo-hadoop, MongoDB and Node.js to turn Avro records into a web service. We do so to illustrate Pig’s ability to act as glue between distributed systems, and to show how easy it is to publish data from Hadoop to the web.
I was tempted to add ‘duct tape’ as a category. But there could only be one entry. 😉
Take an early weekend and have some fun with this tomorrow. August will be over sooner than you think.
Comments Off on Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js
August 5, 2012
More Fun with Hadoop In Action Exercises (Pig and Hive) by Sujit Pal.
From the post:
In my last post, I described a few Java based Hadoop Map-Reduce solutions from the Hadoop in Action (HIA) book. According to the Hadoop Fundamentals I course from Big Data University, part of being a Hadoop practioner also includes knowing about the many tools that are part of the Hadoop ecosystem. The course briefly touches on the following four tools – Pig, Hive, Jaql and Flume.
Of these, I decided to focus (at least for the time being) on Pig and Hive (for the somewhat stupid reason that the HIA book covers these too). Both of these are are high level DSLs that produce sequences of Map-Reduce jobs. Pig provides a data flow language called PigLatin, and Hive provides a SQL-like language called HiveQL. Both tools provide a REPL shell, and both can be extended with UDFs (User Defined Functions). The reason they coexist in spite of so much overlap is because they are aimed at different users – Pig appears to be aimed at the programmer types and Hive at the analyst types.
The appeal of both Pig and Hive lies in the productivity gains – writing Map-Reduce jobs by hand gives you control, but it takes time to write. Once you master Pig and/or Hive, it is much faster to generate sequences of Map-Reduce jobs. In this post, I will describe three use cases (the first of which comes from the HIA book, and the other two I dreamed up).
More useful Hadoop exercise examples.
Comments Off on More Fun with Hadoop In Action Exercises (Pig and Hive)
July 11, 2012
Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch
Russell Jurney continues his posts on searching at scale:
Working code examples for this post (for both Pig 0.10 and ElasticSearch 0.18.6) are available here.
ElasticSearch makes search simple. ElasticSearch is built over Lucene and provides a simple but rich JSON over HTTP query interface to search clusters of one or one hundred machies. You can get started with ElasticSearch in five minutes, and it can scale to support heavy loads in the enterprise. ElasticSearch has a Whirr Recipe, and there is even a Platform-as-a-Service provider, Bonsai.io.
Apache Pig makes Hadoop simple. In a previous post, we prepared the Berkeley Enron Emails in Avro format. The entire dataset is available in Avro format here: https://s3.amazonaws.com/rjurney.public/enron.avro. Lets check them out:
Scale is important for some queries but what other factors are important for searches?
Thinking that Google is searching at scale. Is that a counter-example to scale being the only measure of search success? Or the best measure?
Or is scale of searching just a starting point?
Where do you go after scale? Scale is easy to evaluate/measure, so whatever your next step, how is it evaluated or measured?
Or is that the reason for emphasis on scale/size? It’s an easy mark(in several senses)?
Comments Off on Search Data at Scale in Five Minutes with Pig, Wonderdog and ElasticSearch
June 28, 2012
Russell Jumey summarizes machine learning using Pig at the Hadoop Summit:
Jimmy Lin’s sold out talk about Large Scale Machine Learning at Twitter (paper available) (slides available) described the use of Pig to train machine learning algorithms at scale using Hadoop. Interestingly, learning was achieved using a Pig UDF StoreFunc (documentation available). Some interesting, related work can be found by Ted Dunning on github (source available).
The emphasis isn’t on innovation per se but in using Pig to create workflows that include machine learning on large data sets.
Read in detail for the Pig techniques (which you can reuse elsewhere) and the machine learning examples.
Comments Off on Pig as Teacher
Ambrose
From the project page:
Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workflows. It presents a global view of all the map-reduce jobs derived from your workflow after planning and optimization. As jobs are submitted for execution on your Hadoop cluster, Ambrose updates its visualization to reflect the latest job status, polled from your process.
Ambrose provides the following in a web UI:
- A chord diagram to visualize job dependencies and current state
- A table view of all the associated jobs, along with their current state
- A highlight view of the currently running jobs
- An overall script progress bar
One of the items that Russell Jurney reports on in his summary of the Hadoop Summit 2012.
Limited to Pig at the moment but looks quite useful.
Comments Off on Ambrose
Data Integration Services & Hortonworks Data Platform by Jim Walker
From the post:
What’s possible with all this data?
Data Integration is a key component of the Hadoop solution architecture. It is the first obstacle encountered once your cluster is up and running. Ok, I have a cluster… now what? Do I write a script to move the data? What is the language? Isn’t this just ETL with HDFS as another target?Well, yes…
Sure you can write custom scripts to perform a load, but that is hardly repeatable and not viable in the long term. You could also use Apache Sqoop (available in HDP today), which is a tool to push bulk data from relational stores into HDFS. While effective and great for basic loads, there is work to be done on the connections and transforms necessary in these types of flows. While custom scripts and Sqoop are both viable alternatives, they won’t cover everything and you still need to be a bit technical to be successful.
For wide scale adoption of Apache Hadoop, tools that abstract integration complexity are necessary for the rest of us. Enter Talend Open Studio for Big Data. We have worked with Talend in order to deeply integrate their graphical data integration tools with HDP as well as extend their offering beyond HDFS, Hive, Pig and HBase into HCatalog (metadata service) and Oozie (workflow and job scheduler).
Jim covers four advantages of using Talend:
- Bridge the skills gap
- HCatalog Integration
- Connect to the entire enterprise
- Graphic Pig Script Creation
Definitely something to keep in mind.
Comments Off on Data Integration Services & Hortonworks Data Platform
June 27, 2012
The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.
From the post:
Series Introduction
This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.
- Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
- Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
- Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.
Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.
Given the current use and foreseeable use of email, these are important lessons for more than one reason.
What about periodic discovery audits on enterprise email archives?
To see what others may find, or to identify poor wording/disclosure practices?
Comments Off on Booting HCatalog on Elastic MapReduce [periodic discovery audits?]
June 21, 2012
Hortonworks Data Platform v1.0 Download Now Available
From the post:
If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.
Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.
Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.
I’m a member of the general public! And you probably are too! 😉
See the rest of the post for more goodies that are included with this release.
Comments Off on Hortonworks Data Platform v1.0 Download Now Available
June 5, 2012
The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE by Russell Jurney.
From the post:
Series Introduction
This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.
Part one of this series is available here.
Code examples for this post are available here: https://github.com/rjurney/enron-hive.
In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.
Continues the high standard set in part one for walking through an entire data lifecycle in the Hadoop ecosystem.
Comments Off on The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE
May 17, 2012
New Features in Apache Pig 0.10 by Daniel Dai.
This is a useful summary of new features.
Daniel covers each new feature, gives an example and when necessary, a pointer to additional documentation. How cool is that?
Just to whet your appetite, Daniel covers:
- Boolean Data Type
- Nested Cross/Foreach
- JRuby UDF
- Hadoop 0.23 (a.k.a. Hadoop 2.0) Support
and more.
Definitely worth your time to read and to emulate when you write blog posts about new features.
Comments Off on New Features in Apache Pig 0.10
« Newer Posts —
Older Posts »