With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. Apache Tez supports YARN-based, high performance batch and interactive data processing applications in Hadoop that need to handle datasets scaling to terabytes or petabytes.
The Apache community just released Apache Pig 0.14.0,and the main feature is Pig on Tez. In this release, we closed 334 Jira tickets from 35 Pig contributors. Specific credit goes to the virtual team consisting of Cheolsoo Park, Rohini Palaniswamy, Olga Natkovich, Mark Wagner and Alex Bain who were instrumental in getting Pig on Tez working!
This blog gives a brief overview of Pig on Tez and other new features included in the release.
Pig on Tez
Apache Tez is an alternative execution engine focusing on performance. It offers a more flexible interface so Pig can compile into a better execution plan than is possible with MapReduce. The result is consistent performance improvements in both large and small queries.
Since it is the Thanksgiving holiday this week in the United States, this release reminds me to ask why is turkey the traditional Thanksgiving meal? Everyone likes bacon better. 😉
Posted in Hadoop, Pig | Comments Off on Announcing Apache Pig 0.14.0
Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.
We focus our efforts on three simple initiatives:
Make data processing more powerful
Make data processing more simple
Make data processing 100x faster than before
As a data mashing platform, the first key initiative is to combine the power and simplicity of Apache Pig on Apache Spark, making existing ETL pipelines 100x faster than before. We do that via a unique mix of our operator toolkit, called DataDoctor, and Spark.
DataDoctor is a high-level operator DSL on top of Spark. It has frameworks for no-symmetrical joins, sorting, grouping, and embedding native Spark functions. It hides a lot of complexity and makes it simple to implement data operators used in applications like Pig and Apache Hive on Spark.
For the uninitiated, Spark is open source Big Data infrastructure that enables distributed fault-tolerant in-memory computation. As the kernel for the distributed computation, it empowers developers to write testable, readable, and powerful Big Data applications in a number of languages including Python, Java, and Scala.
Introduction to and how to get started using Spork (Pig-on-Spark).
I know, more proof that Phil Karton was correct in saying:
There are only two hard things in Computer Science: cache invalidation and naming things.
Posted in Pig, Spark | Comments Off on Pig is Flying: Apache Pig on Apache Spark (aka “Spork”)
This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0
The tutorials are presented in sections as listed below.
The Apache Pig community released Pig 0.13. earlier this month. Pig uses a simple scripting language to perform complex transformations on data stored in Apache Hadoop. The Pig community has been working diligently to prepare Pig to take advantage of the DAG processing capabilities in Apache Tez. We also improved usability and performance.
This blog post summarizes the progress we’ve made.
If you missed the Pig 0.13.0 release (I did), here’s a chance to catch up on the latest improvements.
Posted in Pig | Comments Off on Announcing Apache Pig 0.13.0
The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.
Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory, yet require little effort to decode. The encoding schemes provide an extra level of space savings beyond the overall compression for each data file.
Large file size: The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.
Impala can create Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL queries on the resulting data files. Parquet tables created by Impala can be accessed by Apache Hive, and vice versa.
That said, the CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data processing. For example, you can read and write Parquet files using Apache Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all of those components — including ones external to CDH, such as Cascading and Apache Tajo.
In this blog post, you will learn the most important principles involved.
Since I mentioned ROOT files yesterday, I am curious what you make of the use of Thrift metadata definitions to read Parquet files?
It’s great that data can be documented for reading, but reading doesn’t imply to me that its semantics have been captured.
A wide variety of products read data, less certain they can document data semantics.
We love Apache Pig for data processing—it’s easy to learn, it works with all kinds of data, and it plays well with Python, Java, and other popular languages. And, of course, Pig runs on Hadoop, so it’s built for high-scale data science.
Whether you’re just getting started with Pig or you’ve already written a variety of Pig scripts, this compact reference gathers in one place many of the tools you’ll need to make the most of your data using Pig 0.12
Easier on the eyes than a one pager!
Not to mention being a good example of how to write and format a cheat sheet.
I recently found two incredible functions in Apache Pig called CUBE and ROLLUP that every data scientist should know. These functions can be used to compute multi-level aggregations of a data set. I found the documentation for these functions to be confusing, so I will work through a simple example to explain how they work.
Joshua starts his post with a demonstration of using GROUP BY in Pig for simple aggregations. That sets the stage for demonstrating how important CUBE and ROLLUP can be for data aggregations in PIG.
Interesting possibilities suggest themselves by the time you finish Joshua’s posting.
Dave Fauth has written a great three part series on extracting “insights” from large amounts of data.
From the third post in the series:
Earlier this year, Sunlight foundation filed a lawsuit under the Freedom of Information Act. The lawsuit requested solication and award notices from FBO.gov. In November, Sunlight received over a decade’s worth of information and posted the information on-line for public downloading. I want to say a big thanks to Ginger McCall and Kaitlin Devine for the work that went into making this data available.
In the first part of this series, I looked at the data and munged the data into a workable set. Once I had the data in a workable set, I created some heatmap charts of the data looking at agencies and who they awarded contracts to. In part two of this series, I created some bubble charts looking at awards by Agency and also the most popular Awardees.
In the third part of the series, I am going to look at awards by date and then displaying that information in a calendar view. Then we will look at the types of awards.
For the date analysis, we are going to use all of the data going back to 2000. We have six data files that we will join together, filter on the ‘Notice Type’ field, and then calculate the counts by date for the awards. The goal is to see when awards are being made.
The most compelling lesson from this series is that data doesn’t always easily give up its secrets.
If you make it to the end of the series, you will find the government, on occasion, does the right thing. I’ll admit it, I was very surprised. 😉
Over the past three years, Endgame received 40 million samples of malware equating to roughly 19TB of binary data. In this, we’re not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10 million samples in the last quarter of 2012. Its total corpus is estimated to be about 100 million samples. VirusTotal receives between 300,000 and 600,000 unique files per day, and of those roughly one-third to half are positively identified as malware (as of April 9, 2013).
This huge volume of malware offers both challenges and opportunities for security research, especially applied machine learning. Endgame performs static analysis on malware in order to extract feature sets used for performing large-scale machine learning. Since malware research has traditionally been the domain of reverse engineers, most existing malware analysis tools were designed to process single binaries or multiple binaries on a single computer and are unprepared to confront terabytes of malware simultaneously. There is no easy way for security researchers to apply static analysis techniques at scale; companies and individuals that want to pursue this path are forced to create their own solutions.
Our early attempts to process this data did not scale well with the increasing flood of samples. As the size of our malware collection increased, the system became unwieldy and hard to manage, especially in the face of hardware failures. Over the past two years we refined this system into a dedicated framework based on Hadoop so that our large-scale studies are easier to perform and are more repeatable over an expanding dataset.
To address this problem, we created an open source framework, BinaryPig, built on Hadoop and Apache Pig (utilizing CDH, Cloudera’s distribution of Hadoop and related projects) and Python. It addresses many issues of scalable malware processing, including dealing with increasingly large data sizes, improving workflow development speed, and enabling parallel processing of binary files with most pre-existing tools. It is also modular and extensible, in the hope that it will aid security researchers and academics in handling ever-larger amounts of malware.
For more details about BinaryPig’s architecture and design, read our paper from Black Hat USA 2013 or check out our presentation slides. BinaryPig is an open source project under the Apache 2.0 License, and all code is available on Github.
You may have heard the rumor that storing more than seven (7) days of food marks you as a terrorist in the United States.
Be forewarned: Doing Massive Malware Analsysis May Make You A Terrorist Suspect.
The “storing more than seven (7) days of food” rumor originated with Rand Paul R-Kentucky.
The Community Against Terrorism FBI flyer, assuming the pointers I found are accurate, says nothing about how many days of food you have on hand.
Rather it says:
Make bulk purchases of items to include:
Meals Ready to Eat
That’s an example of using small data analysis to disprove a rumor.
Unless you are an anthropologist, I would not rely on data from CSpan2.
There are no two ways around it, Hadoop development iterations are slow. Traditional programmers have always had the benefit of re-compiling their app, running it, and seeing the results within seconds. They have near instant validation that what they’re building is actually working. When you’re working with Hadoop, dealing with gigabytes of data, your development iteration time is more like hours.
Inspired by the amazing real-time feedback experience of Light Table, we’ve built Mortar Watchtower to bring back that almost instant iteration cycle developers are used to. Not only that, Watchtower also helps surface the semantics of your Pig scripts, to give you insight into how your scripts are working, not just that they are working.
Watchtower is a daemon that sits in the background, continuously flowing a sample of your data through your script while your work. It captures what your data looks like, and shows how it mutates at each step, directly inline with your script.
I am not sure about the “…helps surface the semantics of your Pig scripts…,” but just checking scripts against data is a real boon.
I continue to puzzle over how the semantics of data and operations in Pig scripts should be documented.
Old style C comments seem out of place in 21st century programming.
This installment of the Hue demo series is about accessing the Hive Metastore from Hue, as well as using HCatalog with Hue. (Hue, of course, is the open source Web UI that makes Apache Hadoop easier to use.)
What is HCatalog?
HCatalog is a module in Apache Hive that enables non-Hive scripts to access Hive tables. You can then directly load tables with Apache Pig or MapReduce without having to worry about re-defining the input schemas, or caring about or duplicating the data’s location.
Hue contains a Web application for accessing the Hive metastore called Metastore Browser, which lets you explore, create, or delete databases and tables using wizards. (You can see a demo of these wizards in a previous tutorial about how to analyze Yelp data.) However, Hue uses HiveServer2 for accessing the metastore instead of HCatalog. This is because HiveServer2 is the new secure and concurrent server for Hive and it includes a fast Hive Metastore API.
HCatalog connectors are still useful for accessing Hive data through Pig, though. Here is a demo about accessing the Hive example tables from the Pig Editor:
Even prior to the semantics of data is access to the data! 😉
Plus mentions of what’s coming in Hue 3.0. (go read the post)
Posted in Hive, Hue, Pig | Comments Off on Using Hue to Access Hive Data Through Pig
Are you struggling with the basic ‘WordCount’ demo, or which Mahout algorithm you should be using? Forget hand-coding and see what you can do with Talend Studio.
In this on-demand webinar we demonstrate how you could become MUCH more productive with Hadoop and NoSQL. Talend Big Data allows you to develop in Eclipse and run your data jobs 100% natively on Hadoop… and become a big data guru over night. Rémy Dubois, big data specialist and Talend Lead developer, shows you in real-time:
How to visually create the ‘WordCount’ example in under 5 minutes
How to graphically build a big data job to perform sentiment analysis
How to archive NoSQL and optimize data warehouse usage
A content filled webinar! Who knew?
Be forewarned that the demos presume familiarity with the Talend interface and the demo presenter is difficult to understand.
From what I got out of the earlier parts of the webinar, very much a step in the right direction to empower users with big data.
Think of the distance between stacks of punch cards (Hadoop/MapReduce a few years ago) and the personal computer (Talend and others).
That was a big shift. This one is likely to be as well.
Looks like I need to spend some serious time with the latest Talend release!
Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.
Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.
The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.
And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:
Apache Zookeeper 3.4.5
Apache Flume 1.3.1
Apache HBase 0.94.5
Apache Pig 0.11.1
Apache Hive 0.10.0
Apache Sqoop 2 (AKA 1.99.2)
Apache Oozie 3.3.2
Apache Whirr 0.8.2
Apache Mahout 0.7
Apache Solr (SolrCloud) 4.2.1
Apache Crunch (incubating) 0.5.0
Apache HCatalog 0.5.0
Apache Giraph 1.0.0
LinkedIn DataFu 0.0.6
Cloudera Hue 2.3.0
And we were just talking about YARN and applications weren’t we? 😉
(Participate if you can but at least send a note of appreciation to Cloudera.)
Complementing the editors for Hive and Cloudera Impala, the Pig editor provides a great starting point for exploration and real-time interaction with Hadoop. This new application lets you edit and run Pig scripts interactively in an editor tailored for a great user experience. Features include:
UDFs and parameters (with default value) support
Autocompletion of Pig keywords, aliases, and HDFS paths
One-click script submission
Progress, result, and logs display
Interactive single-page application
Here’s a short video demoing its capabilities and ease of use:
How are you editing your Pig scripts now?
How are you documenting the semantics of your Pig scripts?
For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.
As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.
Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)
This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 — it’s a great way to contribute to open source software.
This blog post hits some of the highlights of the release. Pig users may also find a presentation by Daniel Dai, which includes code and output samples for the new operators, helpful.
Pig is a SQL like command language for use with Hadoop, we review a simple PIG script line by line to help you understand how pig works, and regular expressions to help parse data. If you want a copy of the slide presentation – they are over on slide share http://www.slideshare.net/rmorrill.
Very good intro to PIG!
Mentions a couple of resources you need to bookmark:
Input Validation Cheat Sheet (The Open Web Security Application Project – OWASP) – regexes to re-use in Pig scripts. Lots of other regex cheat sheet pointers. (Being mindful that “\” must be escaped in PIG.)
According to the Transaction Processing Council, TPC-H is:
The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.
TPC-H was implemented for Hive in HIVE-600 and for Pig in PIG-2397 by Hortonworks intern Jie Li. In going over this work, I was struck by how it outlined differences between Pig and SQL.
There seems to be tendency for simple SQL to provide greater clarity than Pig. At some point as the TPC-H queries become more demanding, complex SQL seems to have less clarity than the comparable Pig. Lets take a look.
(emphasis in original)
A refresher in the lesson that what solution you need, in this case Hive or PIg, depends upon your requirements.
Use either one blindly at the risk of poor performance or failing to meet other requirements.
Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.
How do you solve this mismatch? If you’re in the early stages of designing a schema, you could reconsider a more row based approach. If you have to work with an existing schema, however, you can with the help of Pig UDFs.
Now there’s an ugly problem.
You can split the label from the data as shown, but that doesn’t help when the label/data is still in situ.
Saying: “Don’t do that!” doesn’t help because it is already being done.
If anything, topic maps need to take subjects as they are found, not as we might wish for them to be.
Curious, would you write an identifier as a regex that parses such a mix of label and data, assigning each to further processing?