Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 16, 2013

ST_Geometry Aggregate Functions for Hive…

Filed under: Geographic Data,Geographic Information Retrieval,Hadoop,Hive — Patrick Durusau @ 4:00 pm

ST_Geometry Aggregate Functions for Hive in Spatial Framework for Hadoop by Jonathan Murphy.

From the post:

We are pleased to announce that the ST_Geometry aggregate functions are now available for Hive, in the Spatial Framework for Hadoop. The aggregate functions can be used to perform a convex-hull, intersection, or union operation on geometries from multiple records of a dataset.

While the non-aggregate ST_ConvexHull function returns the convex hull of the geometries passed like a single function call, the ST_Aggr_ConvexHull function accumulates the geometries from the rows selected by a query, and performs a convex hull operation over those geometries. Likewise, ST_Aggr_Intersection and ST_Aggr_Union aggregrate the geometries from multiple selected rows, to perform intersection and union operations, respectively.

The example given covers earthquake data and California-county data.

I have a weakness for aggregating functions as you know. 😉

The other point this aggregate functions illustrates is that sometimes you want subjects to be treated as independent of each other and sometimes you want to treat them as a group.

Depends upon your requirements.

There really isn’t a one size fits all granularity of subject identity for all situations.

August 9, 2013

Using Hue to Access Hive Data Through Pig

Filed under: Hive,Hue,Pig — Patrick Durusau @ 2:39 pm

Demo: Using Hue to Access Hive Data Through Pig by Hue Team.

From the post:

This installment of the Hue demo series is about accessing the Hive Metastore from Hue, as well as using HCatalog with Hue. (Hue, of course, is the open source Web UI that makes Apache Hadoop easier to use.)

What is HCatalog?

HCatalog is a module in Apache Hive that enables non-Hive scripts to access Hive tables. You can then directly load tables with Apache Pig or MapReduce without having to worry about re-defining the input schemas, or caring about or duplicating the data’s location.

Hue contains a Web application for accessing the Hive metastore called Metastore Browser, which lets you explore, create, or delete databases and tables using wizards. (You can see a demo of these wizards in a previous tutorial about how to analyze Yelp data.) However, Hue uses HiveServer2 for accessing the metastore instead of HCatalog. This is because HiveServer2 is the new secure and concurrent server for Hive and it includes a fast Hive Metastore API.

HCatalog connectors are still useful for accessing Hive data through Pig, though. Here is a demo about accessing the Hive example tables from the Pig Editor:

Even prior to the semantics of data is access to the data! 😉

Plus mentions of what’s coming in Hue 3.0. (go read the post)

July 27, 2013

…Spatial Analytics with Hive and Hadoop

Filed under: Geo Analytics,Hadoop,Hive — Patrick Durusau @ 5:54 pm

How To Perform Spatial Analytics with Hive and Hadoop by Carter Shanklin.

From the post:

One of the big opportunities that Hadoop provides is the processing power to unlock value in big datasets of varying types from the ‘old’ such as web clickstream and server logs, to the new such as sensor data and geolocation data.

The explosion of smart phones in the consumer space (and smart devices of all kinds more generally) has continued to accelerate the next generation of apps such as Foursquare and Uber which depend on the processing of and insight from huge volumes of incoming data.

In the slides below we look at a sample, anonymized data set from Uber that is available on Infochimps. We step through basics of analyzing the data in Hive and learn how a new using spatial analysis decide whether a new product offering is viable or not.

Great tutorial and slides!

My only reservation is the use of geo-location data to make a judgement about the potential for a new ride service.

Geo-location data is only way to determine potential for a ride service. Surveying potential riders would be another.

Or to put it another way, having data to crunch, doesn’t mean crunching data will lead to the best answer.

July 24, 2013

…Sentry: Fine-Grained Authorization for Impala and Apache Hive

Filed under: BigData,Cloudera,Cybersecurity,Hive,Impala,Security — Patrick Durusau @ 2:19 pm

…Sentry: Fine-Grained Authorization for Impala and Apache Hive

From the post:

Cloudera, the leader in enterprise analytic data management powered by Apache Hadoop™, today unveiled the next step in the evolution of enterprise-class big data security, introducing Sentry: a new Apache licensed open source project that delivers the industry’s first fine-grained authorization framework for Hadoop. An independent security module that integrates with open source SQL query engines Apache Hive and Cloudera Impala, Sentry delivers advanced authorization controls to enable multi-user applications and cross-functional processes for enterprise datasets. This level of granular control, available for the first time in Hadoop, is imperative to meet enterprise Role Based Access Control (RBAC) requirements of highly regulated industries, like healthcare, financial services and government. Sentry alleviates the security concerns that have prevented some organizations from opening Hadoop data systems to a more diverse set of users, extending the power of Hadoop and making it suitable for new industries, organizations and enterprise use cases. Concurrently, the company confirmed it plans to submit the Sentry security module to the Apache Incubator at the Apache Software Foundation later this year.

Welcome news but I could not bring myself to include all the noise words in the press release title. 😉

For technical details, see: http://cloudera.com/content/cloudera/en/Campaign/introducing-sentry.html.

Just a word of advice: This doesn’t “solve” big data security issues. It is one aspect of big data security.

Another aspect of big data security is not allowing people to bring in and leave your facility with magnetic media. Ever.

Not to mention using glue to permanently close all USB ports and CD/DVD drives.

There is always tension between how much security do you need versus the cost and inconvenience.

Another form of security: Have your supervisor’s approval in writing for deviations from known “good” security practices.

June 26, 2013

Apache Bigtop: The “Fedora of Hadoop”…

Filed under: Bigtop,Crunch,DataFu,Flume,Giraph,HBase,HCatalog,Hive,Hue,Mahout,Oozie,Pig,Solr,Sqoop,Zookeeper — Patrick Durusau @ 10:45 am

Apache Bigtop: The “Fedora of Hadoop” is Now Built on Hadoop 2.x by Roman Shaposhnik.

From the post:

Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

The very astute readers of this blog will notice that given our quarterly release schedule, Bigtop 0.6.0 should have been called Bigtop 0.7.0. It is true that we skipped a quarter. Our excuse is that we spent all this extra time helping the Hadoop community stabilize the Hadoop 2.x code line and making it a robust kernel for all the applications that are now part of the Bigtop distribution.

And speaking of applications, we haven’t forgotten to grow the Bigtop family: Bigtop 0.6.0 adds Apache HCatalog and Apache Giraph to the mix. The full list of Hadoop applications available as part of the Bigtop 0.6.0 release is:

  • Apache Zookeeper 3.4.5
  • Apache Flume 1.3.1
  • Apache HBase 0.94.5
  • Apache Pig 0.11.1
  • Apache Hive 0.10.0
  • Apache Sqoop 2 (AKA 1.99.2)
  • Apache Oozie 3.3.2
  • Apache Whirr 0.8.2
  • Apache Mahout 0.7
  • Apache Solr (SolrCloud) 4.2.1
  • Apache Crunch (incubating) 0.5.0
  • Apache HCatalog 0.5.0
  • Apache Giraph 1.0.0
  • LinkedIn DataFu 0.0.6
  • Cloudera Hue 2.3.0

And we were just talking about YARN and applications weren’t we? 😉

Enjoy!

(Participate if you can but at least send a note of appreciation to Cloudera.)

June 9, 2013

Presto is Coming!

Filed under: Facebook,Hive,Presto — Patrick Durusau @ 5:34 pm

Facebook unveils Presto engine for querying 250 PB data warehouse by Jordan Novet.

From the post:

At a conference for developers at Facebook headquarters on Thursday, engineers working for the social networking giant revealed that it’s using a new homemade query engine called Presto to do fast interactive analysis on its already enormous 250-petabyte-and-growing data warehouse.

More than 850 Facebook employees use Presto every day, scanning 320 TB each day, engineer Martin Traverso said.

“Historically, our data scientists and analysts have relied on Hive for data analysis,” Traverso said. “The problem with Hive is it’s designed for batch processing. We have other tools that are faster than Hive, but they’re either too limited in functionality or too simple to operate against our huge data warehouse. Over the past few months, we’ve been working on Presto to basically fill this gap.”

Facebook created Hive several years ago to give Hadoop some data warehouse and SQL-like capabilities, but it is showing its age in terms of speed because it relies on MapReduce. Scanning over an entire dataset could take many minutes to hours, which isn’t ideal if you’re trying to ask and answer questions in a hurry.

With Presto, however, simple queries can run in a few hundred milliseconds, while more complex ones will run in a few minutes, Traverso said. It runs in memory and never writes to disk, Traverso said.

Traverso goes onto say that Facebook will opensource Presto this coming Fall.

See my prior post on a more technical description of Presto: Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.

Bear in mind that getting an answer from 250 PB of data quickly isn’t the same thing as getting a useful answer quickly.

May 18, 2013

Apache Hive 0.11: Stinger Phase 1 Delivered

Filed under: Hadoop,Hive,SQL,STINGER — Patrick Durusau @ 3:47 pm

Apache Hive 0.11: Stinger Phase 1 Delivered by Owen O’Malley.

From the post:

In February, we announced the Stinger Initiative, which outlined an approach to bring interactive SQL-query into Hadoop. Simply put, our choice was to double down on Hive to extend it so that it could address human-time use cases (i.e. queries in the 5-30 second range). So, with input and participation from the broader community we established a fairly audacious goal of 100X performance improvement and SQL compatibility.

Introducing Apache Hive 0.11 – 386 JIRA tickets closed

As representatives of this open, community led effort we are very proud to announce the first release of the new and improved Apache Hive, version 0.11. This substantial release embodies the work of a wide group of people from Microsoft, Facebook , Yahoo, SAP and others. Together we have addressed 386 JIRA tickets, of which there were 28 new features and 276 bug fixes. There were FIFTY-FIVE developers involved in this and I would like to thank every one of them. See below for a full list.

Delivering on the promise of Stinger Phase 1

As promised we have delivered phase 1 of the Stinger Initiative in late spring. This release is another proof point that that the open community can innovate at a rate unequaled by any proprietary vendor. As part of phase 1 we promised windowing, new data types, the optimized RC (ORC) file and base optimizations to the Hive Query engine and the community has delivered these key features.

Stinger

Welcome news for the Hive and SQL communities alike!

April 28, 2013

What’s New in Hue 2.3

Filed under: Hadoop,Hive,Hue,Impala — Patrick Durusau @ 3:43 pm

What’s New in Hue 2.3

From the post:

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Here’s the new features list:

  • Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters
  • Table Browser: new application for managing Apache Hive databases, viewing table schemas and sample of content
  • Apache Oozie Bundles are now supported
  • SQL highlighting and auto-completion for Hive/Impala apps
  • Multi-query and highlight/run a portion of a query
  • Job Designer was totally restyled and now supports all Oozie actions
  • Oracle databases (11.2 and later) are now supported

Time to upgrade!

April 19, 2013

Analyzing Data with Hue and Hive

Filed under: Hadoop,Hive,Hue — Patrick Durusau @ 2:06 pm

Analyzing Data with Hue and Hive by Romain Rigaux.

From the post:

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

I think the demo would be more effective if a city known for good food, New Orleans, for example, had been chosen for the challenge.

But given the complexity of the cuisine, that would be a stress test for human experts.

What chance would Apache Hadoop have? 😉

March 26, 2013

Analyzing Twitter Data with Apache Hadoop, Part 3:…

Filed under: Hadoop,Hive,Tweets — Patrick Durusau @ 12:52 pm

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive by Jon Natkins.

From the post:

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

I didn’t realize I had missed this part of the Hive series until I saw it mentioned in the Hue post.

Good introduction to Hive.

BTW, is Twitter data becoming the “hello world” of data mining?

How-to: Analyze Twitter Data with Hue

Filed under: Hadoop,Hive,Hue,Tweets — Patrick Durusau @ 12:46 pm

How-to: Analyze Twitter Data with Hue by Romain Rigaux.

From the post:

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

The Hadoop ecosystem continues to improve!

Question: Is anyone keeping a current listing/map of the various components in the Hadoop ecosystem?

March 25, 2013

HOWTO use Hive to SQLize your own Tweets…

Filed under: Hive,SQL,Tweets — Patrick Durusau @ 2:59 am

HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery by Russell Jurney.

HOWTO use Hive to SQLize your own Tweets – Part Two: Loading Hive, SQL Queries

Russell walks you through extracting your tweets, discovering their schema, loading them into Hive and querying the result.

I just requested my tweets on Friday so expect to see them tomorrow or Tuesday.

Will be a bit more complicated than Russell’s example because I re-post tweets about older posts on my blog.

I will have to delete those, although I may want to know when a particular tweet appeared, which means I will need to capture the date(s) when a particular tweet appeared.

BTW, if you do obtain your tweet archive, consider donating it to #Tweets4Science.

February 20, 2013

The Stinger Initiative: Making Apache Hive 100 Times Faster

Filed under: Hive,Hortonworks — Patrick Durusau @ 9:23 pm

The Stinger Initiative: Making Apache Hive 100 Times Faster by Alan Gates.

From the post:

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop. Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface. The who’s who of business analytics have already adopted Hive.

Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases. These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

Leveraging on existing skills and infrastructure.

Who knows? Hortonworks maybe about to start a trend!

February 18, 2013

Writing Hive UDFs – a tutorial

Filed under: Hive,HiveQL — Patrick Durusau @ 9:44 am

Writing Hive UDFs – a tutorial by Alexander Dean.

Synopsis:

In this article you will learn how to write a user-defined function (“UDF”) to work with the Apache Hive platform. We will start gently with an introduction to Hive, then move on to developing the UDF and writing tests for it. We will write our UDF in Java, but use Scala’s SBT as our build tool and write our tests in Scala with Specs2.

In order to get the most out of this article, you should be comfortable programming in Java. You do not need to have any experience with Apache Hive, HiveQL (the Hive query language) or indeed Hive UDFs – I will introduce all of these concepts from first principles. Experience with Scala is advantageous, but not necessary.

The example UDF isn’t impressive so those are left as an exercise for the reader. 😉

Also of interest:

Hive User Defined Functions (at the Apache Hive wiki).

Which you should compare to:

What are the biggest feature gaps between HiveQL and SQL? (at Quora)

There are plenty of opportunities for new UDFs, including those addressing semantic integration.

I first saw this in NoSQL Weekly, Issue 116.

February 13, 2013

Imperative and Declarative Hadoop: TPC-H in Pig and Hive

Filed under: Hadoop,Hive,MapReduce,Pig,TPC-H — Patrick Durusau @ 11:41 am

Imperative and Declarative Hadoop: TPC-H in Pig and Hive by Russell Jurney.

From the post:

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

TPC-H was implemented for Hive in HIVE-600 and for Pig in PIG-2397 by Hortonworks intern Jie Li. In going over this work, I was struck by how it outlined differences between Pig and SQL.

There seems to be tendency for simple SQL to provide greater clarity than Pig. At some point as the TPC-H queries become more demanding, complex SQL seems to have less clarity than the comparable Pig. Lets take a look.
(emphasis in original)

A refresher in the lesson that what solution you need, in this case Hive or PIg, depends upon your requirements.

Use either one blindly at the risk of poor performance or failing to meet other requirements.

February 7, 2013

A Quick Guide to Hadoop Map-Reduce Frameworks

Filed under: Hadoop,Hive,MapReduce,Pig,Python,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 10:45 am

A Quick Guide to Hadoop Map-Reduce Frameworks by Alex Popescu.

Alex has assembled links to guides to MapReduce frameworks:

Thanks Alex!

January 16, 2013

Apache Hive 0.10.0 is Now Available

Filed under: Hadoop,Hive,MapReduce — Patrick Durusau @ 7:57 pm

Apache Hive 0.10.0 is Now Available by Ashutosh Chauhan.

From the post:

We are pleased to announce the the release of Apache Hive version 0.10.0. More than 350 JIRA issues have been fixed with this release. A few of the most important fixes include:

Cube and Rollup: Hive now has support for creating cubes with rollups. Thanks to Namit!

List Bucketing: This is an optimization that lets you better handle skew in your tables. Thanks to Gang!

Better Windows Support: Several Hive 0.10.0 fixes support running Hive natively on Windows. There is no more cygwin dependency. Thanks to Kanna!

Explain’ Adds More Info: Now you can do an explain dependency and the explain plan will contain all the tables and partitions touched upon by the query. Thanks to Sambavi!

Improved Authorization: The metastore can now optionally do authorization checks on the server side instead of on the client, providing you with a better security profile. Thanks to Sushanth!

Faster Simple Queries: Some simple queries that don’t require aggregations, and therefore MapReduce jobs, can now run faster.Thanks to Navis!

Better YARN Support: This release contains additional work aimed at making Hive work well with Hadoop YARN. While not all test cases are passing yet, there has been a lot of good progress made with this release. Thanks to Zhenxiao!

Union Optimization: Hive queries with unions will now result in a lower number of MapReduce jobs under certain conditions. Thanks to Namit!

Undo Your Drop Table: While not really truly ‘undo’, you can now reinstate your table after dropping it. Thanks to Andrew!

Show Create Table: The lets you see how you created your table. Thanks to Feng!

Support for Avro Data: Hive now has built-in support for reading/writing Avro data. Thanks to Jakob!

Skewed Joins: Hive’s support for joins involving skewed data is now improved. Thanks to Namit!

Robust Connection Handling at the Metastore Layer: Connection handling between a metastore client and server and also between a metastore server and the database layer has been improved. Thanks to Bhushan and Jean!

More Statistics: Its now possible to collect and store scalar-valued statistics for your tables and partitions. This will enable better query planning in upcoming releases. Thanks to Shreepadma!

Better-Looking HWI : HWI now uses a bootstrap javascript library. It looks really slick. Thanks to Hugo!

If you are excited about some of these new features, I recommend that you download hive-0.10 from: Hive 0.10 Release.

The full Release Notes are available here: Hive 0.10.0 Release Notes

This release saw contributions from many different people. We have numerous folks reporting bugs, writing patches for new features, fixing bugs, testing patches, helping users on mailing lists etc. We would like to give a big thank you to everyone who made hive-0.10 possible.

-Ashutosh Chauhan

A long quote but it helps to give credit where credit is due.

January 5, 2013

Apache Crunch

Filed under: Cascading,Hive,MapReduce,Pig — Patrick Durusau @ 7:50 am

Apache Crunch: A Java Library for Easier MapReduce Programming by Josh Wills.

From the post:

Apache Crunch (incubating) is a Java library for creating MapReduce pipelines that is based on Google’s FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig, and Cascading, Crunch provides a library of patterns to implement common tasks like joining data, performing aggregations, and sorting records. Unlike those other tools, Crunch does not impose a single data type that all of its inputs must conform to. Instead, Crunch uses a customizable type system that is flexible enough to work directly with complex data such as time series, HDF5 files, Apache HBase tables, and serialized objects like protocol buffers or Avro records.

Crunch does not try to discourage developers from thinking in MapReduce, but it does try to make thinking in MapReduce easier to do. MapReduce, for all of its virtues, is the wrong level of abstraction for many problems: most interesting computations are made up of multiple MapReduce jobs, and it is often the case that we need to compose logically independent operations (e.g., data filtering, data projection, data transformation) into a single physical MapReduce job for performance reasons.

Essentially, Crunch is designed to be a thin veneer on top of MapReduce — with the intention being not to diminish MapReduce’s power (or the developer’s access to the MapReduce APIs) but rather to make it easy to work at the right level of abstraction for the problem at hand.

Although Crunch is reminiscent of the venerable Cascading API, their respective data models are very different: one simple common-sense summary would be that folks who think about problems as data flows prefer Crunch and Pig, and people who think in terms of SQL-style joins prefer Cascading and Hive.

Brief overview of Crunch and an example (word count) application.

Definitely a candidate for your “big data” tool belt.

December 21, 2012

How-to: Use a SerDe in Apache Hive

Filed under: Hive — Patrick Durusau @ 4:04 pm

How-to: Use a SerDe in Apache Hive by Jonathan Natkins.

From the post:

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

In this article, we will examine a SerDe for processing JSON data, which can be used to transform a JSON record into something that Hive can process.

You may be too busy to notice if you have any presents under the tree. 😉

October 3, 2012

CDH4.1 Now Released!

Filed under: Cloudera,Flume,Hadoop,HBase,HDFS,Hive,Pig — Patrick Durusau @ 8:28 pm

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

September 19, 2012

Analyzing Twitter Data with Hadoop [Hiding in a Public Data Stream]

Filed under: Cloudera,Flume,Hadoop,HDFS,Hive,Oozie,Tweets — Patrick Durusau @ 10:46 am

Analyzing Twitter Data with Hadoop by Jon Natkins

From the post:

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.

Looking forward to more posts in this series!

Social media is a focus for marketing teams for obvious reasons.

Analysis of snaps, crackles and pops en masse.

What if you wanted to communicate securely with others using social media?

Thinking of something more robust and larger than two (or three) lovers agreeing on code words.

How would you hide in a public data stream?

Or the converse, how would you hunt for someone in a public data stream?

How would you use topic maps to manage the semantic side of such a process?

September 3, 2012

Small Data (200 MB up to 10 GB) [MySQL, MapReduce and Hive by the Numbers]

Filed under: Hive,MapReduce,MySQL — Patrick Durusau @ 1:09 pm

Study Stacks MySQL, MapReduce and Hive

From the post:

Many small and medium sized businesses would like to get in on the big data game but do not have the resources to implement parallel database management systems. That being the case, which relational database management system would provide small businesses the highest performance?

This question was asked and answered by Marissa Hollingsworth of Boise State University in a graduate case study that compared the performance rates of MySQL, Hadoop MapReduce, and Hive at scales no larger than nine gigabytes.

Hollingsworth also used only relational data, such as payment information, which stands to reason since anything more would require a parallel system. “This experiment,” said Hollingsworth “involved a payment history analysis which considers customer, account, and transaction data for predictive analytics.”

The case study, the full text of which can be found here, concluded that MapReduce would beat out MySQL and Hive for datasets larger than one gigabyte. As Hollingsworth wrote, “The results show that the single server MySQL solution performs best for trial sizes ranging from 200MB to 1GB, but does not scale well beyond that. MapReduce outperforms MySQL on data sets larger than 1GB and Hive outperforms MySQL on sets larger than 2GB.”

Although your friends may not admit it, some of them have small data. Or interact with clients with small data.

You print this post out and put it in their inbox. Anonymously. They will appreciate it even if they can’t acknowledge having seen it.

When thinking about data and data storage, you might want to keep the comparisons you will find at: How much is 1 byte, kilobyte, megabyte, gigabyte, etc.? in mind.

Roughly speaking, 1 GB is the equivalent of 4,473 books.

The 10 GB limit in this study is roughly 44,730 books.

Sometimes all you need is small data.

August 13, 2012

CDH3 update 5 is now available

Filed under: Avro,Cloudera,Flume,Hadoop,HDFS,Hive — Patrick Durusau @ 4:17 pm

CDH3 update 5 is now available by Arvind Prabhakar

From the post:

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

  • Flume 1.2.0 – Provides a durable file channel and many more features over the previous release.
  • Hive AvroSerDe – Replaces the Haivvreo SerDe and provides robust support for Avro data format.
  • WebHDFS – A full read/write REST API to HDFS.

Maintenance release. Installation is good practice before major releases.

August 5, 2012

More Fun with Hadoop In Action Exercises (Pig and Hive)

Filed under: Hadoop,Hive,MapReduce,Pig — Patrick Durusau @ 3:50 pm

More Fun with Hadoop In Action Exercises (Pig and Hive) by Sujit Pal.

From the post:

In my last post, I described a few Java based Hadoop Map-Reduce solutions from the Hadoop in Action (HIA) book. According to the Hadoop Fundamentals I course from Big Data University, part of being a Hadoop practioner also includes knowing about the many tools that are part of the Hadoop ecosystem. The course briefly touches on the following four tools – Pig, Hive, Jaql and Flume.

Of these, I decided to focus (at least for the time being) on Pig and Hive (for the somewhat stupid reason that the HIA book covers these too). Both of these are are high level DSLs that produce sequences of Map-Reduce jobs. Pig provides a data flow language called PigLatin, and Hive provides a SQL-like language called HiveQL. Both tools provide a REPL shell, and both can be extended with UDFs (User Defined Functions). The reason they coexist in spite of so much overlap is because they are aimed at different users – Pig appears to be aimed at the programmer types and Hive at the analyst types.

The appeal of both Pig and Hive lies in the productivity gains – writing Map-Reduce jobs by hand gives you control, but it takes time to write. Once you master Pig and/or Hive, it is much faster to generate sequences of Map-Reduce jobs. In this post, I will describe three use cases (the first of which comes from the HIA book, and the other two I dreamed up).

More useful Hadoop exercise examples.

August 3, 2012

Column Statistics in Hive

Filed under: Cloudera,Hive,Merging,Statistics — Patrick Durusau @ 2:48 pm

Column Statistics in Hive by Shreepadma Venugopalan.

From the post:

Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.

Motivation

While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.

In contrast, the query optimizer in a traditional RDBMS is cost based; it uses the statistical properties of the input column values to estimate the cost alternative query plans and chooses the plan with the lowest cost. The cost model for query plans assigns an estimated execution cost to the plans. The cost model is based on the CPU and I/O costs of query execution for every operator in the query plan. As an example consider a query that represents a join among {A, B, C} with the predicate {A.x == B.x == C.x}. Assume table A has a total of 500 records, table B has a total of 6000 records, table C has a total of 1000 records. In the absence of cost based query optimization, the system picks the join order specified by the user. In our example, let us further assume that the result of joining A and B yields 2000 records and the result of joining A and C yields 50 records.Hence the cost of performing the join between A, B and C, without join reordering, is the cost of joining A and B + cost of joining the output of A Join B with C. In our example this would result in a cost of (500 * 6000) + (2000 * 1000). On the other hand, a cost based optimizer (CBO) in a RDBMS would pick the more optimal alternate order [(A Join C) Join B] thus resulting in a cost of (500 * 1000) + (50 * 6000). However, in order to pick the more optimal join order the CBO needs cardinality estimates on the join column.

Today, Hive supports statistics at the table and partition level – count of files, raw data size, count of rows etc, but doesn’t support statistics on column values. These table and partition level statistics are insufficient for the purpose of building a CBO because they don’t provide any information about the individual column values. Hence obtaining the statistical summary of the column values is the first step towards building a CBO for Hive.

In addition to join reordering, Hive’s query optimizer will be able to take advantage of column statistics to decide whether to perform a map side aggregation as well as estimate the cardinality of operators in the execution plan better.

Some days I wonder where improvements to algorithms and data structures are going to lead?

Other days, I just enjoy the news.

Today is one of the latter.

PS: What a cost based optimizer (CBO) would look like for merging operations? Or perhaps better, merge cost estimator (MCE)? Metered merging anyone?

June 27, 2012

Booting HCatalog on Elastic MapReduce [periodic discovery audits?]

Filed under: Amazon Web Services AWS,HCatalog,Hive,Pig — Patrick Durusau @ 8:06 am

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce by Russell Jurney.

From the post:

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Russell continues walking the Enron Emails through a full data lifecycle in the Hadoop ecosystem.

Given the current use and foreseeable use of email, these are important lessons for more than one reason.

What about periodic discovery audits on enterprise email archives?

To see what others may find, or to identify poor wording/disclosure practices?

June 21, 2012

Hortonworks Data Platform v1.0 Download Now Available

Filed under: Hadoop,HBase,HDFS,Hive,MapReduce,Oozie,Pig,Sqoop,Zookeeper — Patrick Durusau @ 3:36 pm

Hortonworks Data Platform v1.0 Download Now Available

From the post:

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

I’m a member of the general public! And you probably are too! 😉

See the rest of the post for more goodies that are included with this release.

June 5, 2012

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Filed under: Avro,Hive,Pig — Patrick Durusau @ 7:58 pm

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE by Russell Jurney.

From the post:

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Continues the high standard set in part one for walking through an entire data lifecycle in the Hadoop ecosystem.

May 5, 2012

Announcing Apache Hive 0.9.0

Filed under: Hive,NoSQL — Patrick Durusau @ 6:55 pm

Announcing Apache Hive 0.9.0 by Carl Steinbach.

From the post:

This past Monday marked the official release of Apache Hive 0.9.0. Users interested in taking this release of Hive for a spin can download a copy from the Apache archive site. The following post is a quick summary of new features and improvements users can expect to find in this update of the popular data warehousing system for Hadoop.

The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf(), sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.

This Hive release also includes several significant improvements to the query compiler and execution engine. HIVE-2642 improved Hive’s ability to optimize UNION queries, HIVE-2881 made the the map-side JOIN algorithm more efficient, and Hive’s ability to generate optimized execution plans for queries that contain multiple GROUP BY clauses was significantly improved in HIVE-2621.

The database world just keeps getting better!

April 4, 2012

Apache Bigtop 0.3.0 (incubating) has been released

Filed under: Bigtop,Flume,Hadoop,HBase,Hive,Mahout,Oozie,Sqoop,Zookeeper — Patrick Durusau @ 2:33 pm

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

  • Apache Hadoop 1.0.1
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

« Newer PostsOlder Posts »

Powered by WordPress