Archive for the ‘Hive’ Category

Operationalizing a Hadoop Eco-System

Monday, March 2nd, 2015

(Part 1: Installing & Configuring a 3-node Cluster) by Louis Frolio.

From the post:

The objective of DataTechBlog is to bring the many facets of data, data tools, and the theory of data to those curious about data science and big data. The relationship between these disciplines and data can be complex. However, if careful consideration is given to a tutorial, it is a practical expectation that the layman can be brought online quickly. With that said, I am extremely excited to bring this tutorial on the Hadoop Eco-system. Hadoop & MapReduce (at a high level) are not complicated ideas. Basically, you take a large volume of data and spread it across many servers (HDFS). Once at rest, the data can be acted upon by the many CPU’s in the cluster (MapReduce). What makes this so cool is that the traditional approach to processing data (bring data to cpu) is flipped. With MapReduce, CPU is brought to the data. This “divide-and-conquer” approach makes Hadoop and MapReduce indispensable when processing massive volumes of data. In part 1 of this multi-part series, I am going to demonstrate how to install, configure and run a 3-node Hadoop cluster. Finally, at the end I will run a simple MapReduce job to perform a unique word count of Shakespeare’s Hamlet. Future installments of this series will include topics such as: 1. Creating an advanced word count with MapReduce, 2. Installing and running Hive, 3. Installing and running Pig, 4. Using Sqoop to extract and import structured data into HDFS. The goal is to illuminate all the popular and useful tools that support Hadoop.

Operationalizing a Hadoop Eco-System (Part 2: Customizing Map Reduce)

Operationalizing a Hadoop Eco-System (Part 3: Installing and using Hive)

Be forewarned that Louis suggests hosting three Linux VMs on a fairly robust machine. He worked on a Windows 7 x64 machine with 1 TB of storage and 24G of RAM. (How much of that was used by Windows and Office he doesn’t say. 😉 )

The last post in this series was in April 2014 so you may have to look elsewhere for tutorials on Pig and Sqoop.


Download the Hive-on-Spark Beta

Wednesday, February 25th, 2015

Download the Hive-on-Spark Beta by Xuefu Zhang.

From the post:

The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.

Many anxious users have inquired about its availability in the last few months. Some users even built Hive-on-Spark from the branch code and tried it in their testing environments, and then provided us valuable feedback. The team is thrilled to see this level of excitement and early adoption, and has been working around the clock to deliver the product at an accelerated pace.

Thanks to this hard work, significant progress has been made in the last six months. (The project is currently incubating in Cloudera Labs.) All major functionality is now in place, including different flavors of joins and integration with Spark, HiveServer2, and YARN, and the team has made initial but important investments in performance optimization, including split generation and grouping, supporting vectorization and cost-based optimization, and more. We are currently focused on running benchmarks, identifying and prototyping optimization areas such as dynamic partition pruning and table caching, and creating a roadmap for further performance enhancements for the near future.

Two month ago, we announced the availability of an Amazon Machine Image (AMI) for a hands-on experience. Today, we even more proudly present you a Hive-on-Spark beta via CDH parcel. You can download that parcel here. (Please note that in this beta release only HDFS, YARN, Apache ZooKeeper, and Hive are supported. Other components, such as Apache Pig, Apache Oozie, and Impala, might not work as expected.) The “Getting Started” guide will help you get your Hive queries up and running on the Spark engine without much trouble.

We welcome your feedback. For assistance, please use or the Cloudera Labs discussion board.

We will update you again when GA is available. Stay tuned!

If you are snowbound this week, this may be what you have been looking for!

I have listed this under both Hive and Spark separately but am confident enough of its success that I created Hive-on-Spark as well.


Announcing Apache Hive 0.14

Monday, November 24th, 2014

Announcing Apache Hive 0.14 by Gunther Hagleitner.

From the post:

While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. Although many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate.

Last week, the Apache Hive community released Apache Hive 0.14, which includes the results of the first phase in the initiative and takes Hive beyond its read-only roots and extends it with ACID transactions. Thirty developers collaborated on this version and resolved more than 1,015 JIRA issues.

Although there are many new features in Hive 0.14, there are a few highlights we’d like to highlight. For the complete list of features, improvements, and bug fixes, see the release notes.

If you have been watching the work on Spark + Hive: Apache Hive on Apache Spark: The First Demo, then you know how important Hive is to the Hadoop ecosystem.

The highlights:

Transactions with ACID semantics (HIVE-5317)

Allows users to modify data using insert, update and delete SQL statements. This provides snapshot isolation and uses locking for writes. Now users can make corrections to fact tables and changes to dimension tables.

Cost Base Optimizer (CBO) (HIVE-5775)

Now the query compiler uses a more sophisticated cost based optimizer that generates query plans based on statistics on data distribution. This works really well with complex joins and joins with multiple large fact tables. The CBO generates busy plans that execute much faster.

SQL Temporary Tables (HIVE-7090)

Temporary tables exist in scratch space that goes away when the user session disconnects. This allows users and BI tools to store temporary results and further process that data with multiple queries.

Coming Next in Sub-Second Queries

After Hive 0.14, we’re planning on working with the community to deliver sub-second queries and SQL:2011 Analytics coverage in Hive. We also plan to work on Hive-Spark integration for machine learning and operational reporting with Hive streaming ingest and transactions.

Hive is an example of how an open source project should be supported.

Apache Hive on Apache Spark: The First Demo

Friday, November 21st, 2014

Apache Hive on Apache Spark: The First Demo by Brock Noland.

From the post:

Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.

Two months ago Cloudera, Databricks, IBM, Intel, MapR, and others came together to port Apache Hive and the other batch processing engines to Spark. In October at Strata + Hadoop World New York, the Hive on Spark project lead Xuefu Zhang shared the project status and a provided a demo of our work. The same week at the Bay Area Hadoop User Group, Szehon Ho discussed the project and demo’ed the work completed. Additionally, Xuefu and Suhas Satish will be speaking about Hive on Spark at the Bay Area Hive User Group on Dec. 3.

The community has committed more than 140 changes to the Spark branch as part of HIVE-7292 – Hive on Spark. We are proud to say that queries are now functionally able to run, as you can see in the demo below of a multi-node Hive-on-Spark query (query 28 from TPC-DS with a scale factor of 20 on a TPC-DS derived dataset).

After seeing the demo, you will want to move Spark up on your technology to master list!

Avoiding “Hive” Confusion

Thursday, October 23rd, 2014

Depending on your community, when you hear “Hive,” you think “Apache Hive:”

The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

But, there is another “Hive,” which handles large datasets:

High-performance Integrated Virtual Environment (HIVE) is a specialized platform being developed/implemented by Dr. Simonyan’s group at FDA and Dr. Mazumder’s group at GWU where the storage library and computational powerhouse are linked seamlessly. This environment provides web access for authorized users to deposit, retrieve, annotate and compute on HTS data and analyze the outcomes using web-interface visual environments appropriately built in collaboration with research scientists and regulatory personnel.

I ran across this potential source of confusion earlier today and haven’t run it completely to ground but wanted to share some of what I have found so far.

Inside the HIVE, the FDA’s Multi-Omics Compute Architecture by Aaron Krol.

From the post:

“HIVE is not just a conventional virtual cloud environment,” says Simonyan. “It’s a different system that virtualizes the services.” Most cloud systems store data on multiple servers or compute units until users want to run a specific application. At that point, the relevant data is moved to a server that acts as a node for that computation. By contrast, HIVE recognizes which storage nodes contain data selected for analysis, then transfers executable code to those nodes, a relatively small task that allows computation to be performed wherever the data is stored. “We make the computations on exactly the machines where the data is,” says Simonyan. “So we’re not moving the data to the computational unit, we are moving computation to the data.”

When working with very large packets of data, cloud computing environments can sometimes spend more time on data transfer than on running code, making this “virtualized services” model much more efficient. To function, however, it relies on granular and readily-accessed metadata, so that searching for and collecting together relevant data doesn’t consume large quantities of compute time.

HIVE’s solution is the honeycomb data model, which stores raw NGS data and metadata together on the same network. The metadata — information like the sample, experiment, and run conditions that produced a set of NGS reads — is stored in its own tables that can be extended with as many values as users need to record. “The honeycomb data model allows you to put the entire database schema, regardless of how complex it is, into a single table,” says Simonyan. The metadata can then be searched through an object-oriented API that treats all data, regardless of type, the same way when executing search queries. The aim of the honeycomb model is to make it easy for users to add new data types and metadata fields, without compromising search and retrieval.

Popular consumption piece so next you may want to visit the HIVE site proper.

From the webpage:

HIVE is a cloud-based environment optimized for the storage and analysis of extra-large data, like Next Generation Sequencing data, Mass Spectroscopy files, Confocal Microscopy Images and others.

HIVE uses a variety of advanced scientific and computational visualization graphics, to get the MOST from your HIVE experience you must use a supported browser. These include Internet Explore 8.0 or higher (Internet Explorer 9.0 is recommended), Google Chrome, Mozilla Firefox and Safari.

A few exemplary analytical outputs are displayed below for your enjoyment. But before you can take advantage of all that HIVE has to offer and create these objects for yourself, you’ll need to register.

With A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE) by Tsung-Jung Wu, et al., you are starting to approach the computational issues of interest for data integration.

From the article:

The forementioned cooperation is difficult because genomics data are large, varied, heterogeneous and widely distributed. Extracting and converting these data into relevant information and comparing results across studies have become an impediment for personalized genomics (11). Additionally, because of the various computational bottlenecks associated with the size and complexity of NGS data, there is an urgent need in the industry for methods to store, analyze, compute and curate genomics data. There is also a need to integrate analysis results from large projects and individual publications with small-scale studies, so that one can compare and contrast results from various studies to evaluate claims about biomarkers.

See also: High-performance Integrated Virtual Environment (Wikipedia) for more leads to the literature.

Heterogeneous data is still at large and people are building solutions. Rather than either/or, what do you think topic maps could bring as a value-add to this project?

I first saw this in a tweet by ChemConnector.

HDP 2.1 Tutorials

Wednesday, August 13th, 2014

HDP 2.1 tutorials from Hortonworks:

  1. Securing your Data Lake Resource & Auditing User Access with HDP Security
  2. Searching Data with Apache Solr
  3. Define and Process Data Pipelines in Hadoop with Apache Falcon
  4. Interactive Query for Hadoop with Apache Hive on Apache Tez
  5. Processing streaming data in Hadoop with Apache Storm
  6. Securing your Hadoop Infrastructure with Apache Knox

The quality you have come to expect from Hortonwork tutorials but the data sets are a bit dull.

What data sets would you suggest to spice up this tutorials?

Hello World! – Hadoop, Hive, Pig

Tuesday, July 29th, 2014

Hello World! – An introduction to Hadoop with Hive and Pig

A set of tutorials to be run on Sandbox v2.0.

From the post:

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

The tutorials are presented in sections as listed below.

Maybe I have seen too many “Hello World!” examples but I was expecting the tutorials to go through the use of Hadoop, HCatalog, Hive and Pig to say “Hello World!”

You can imagine my disappointment when that wasn’t the case. 😉

A lot of work to say “Hello World!” but on the other hand, tradition is tradition.

Analyzing 1.2 Million Network Packets…

Sunday, June 15th, 2014

Analyzing 1.2 Million Network Packets per Second in Real Time by James Sirota and Sheetal Dolas.

Slides giving an overview of OpenSOC (Open Security Operations Center).

I mention this in case you are not the NSA and simply streaming the backbone of the Internet to storage for later analysis. Some business cases require real time results.

The project is also a good demonstration of building a high throughput system using only open source software.

Not to mention a useful collaboration between Cisco and Hortonworks.

BTW, take a look at slide 18. I would say they are adding information to the representative of a subject, wouldn’t you? While on the surface this looks easy, merging that data with other data, say held by local law enforcement, might not be so easy.

For example, depending on where you are intercepting traffic, you will be told I am about thirty (30) miles from my present physical location or some other answer. 😉 Now, if someone had annotated an earlier packet with that information and it was accessible to you, well, your targeting of my location could be a good deal more precise.

And there is the question of using data annotated by different sources who may have been attacked by the same person or group.

Even at 1.2 million packets per second there is still a role for subject identity and merging.

Yahoo Betting on Apache Hive, Tez, and YARN

Sunday, May 18th, 2014

Yahoo Betting on Apache Hive, Tez, and YARN

With the usual caveats about test results:

On the other hand, Hive 0.13 query execution times were not only significantly better at higher volumes of data (Fig 3 and 4) but also executed successfully without failing. In our comparisons and observations with Shark, we saw most queries fail with the larger (10TB) dataset. These same queries ran successfully and much faster on Hive 0.13, allowing for better scale. This was extremely critical for us, as we needed a single query and BI solution on the Hadoop grid regardless of dataset size. The Hive solution resonates with our users, as they do not have to worry about learning multiple technologies and discerning which solution to use when. A common solution also results in cost and operational efficiencies from having to build, deploy, and maintain a single solution.

Successful 10TB query times and results should be enough to get your attention. Not that many of us have data in that range, today, but tomorrow, who can say?


I first saw this in a tweet by Joshua Lande.

Hive 0.13 and Stinger!

Monday, April 21st, 2014

Announcing Apache Hive 0.13 and Completion of the Stinger Initiative! by Harish Butani.

From the post:

The Apache Hive community has voted on and released version 0.13 today. This is a significant release that represents a major effort from over 70 members who worked diligently to close out over 1080 JIRA tickets.

Hive 0.13 also delivers the third and final phase of the Stinger Initiative, a broad community based initiative to drive the future of Apache Hive, delivering 100x performance improvements at petabyte scale with familiar SQL semantics. These improvements extend Hive beyond its traditional roots and brings true interactive SQL query to Hadoop.

Ultimately, over 145 developers representing 44 companies, from across the Apache Hive community contributed over 390,000 lines of code to the project in just 13 months, nearly doubling the Hive code base.

The three phases of this important project spanned Hive versions 0.11, 0.12 and 0.13. Additionally, the Apache Hive team coordinated this 0.13 release with the simultaneous release of Apache Tez 0.4. Tez’s DAG execution speeds Hive queries run on Tez.

Hive 0.13

Kudos to one and all!

Open source work at its very best!

Cloudera Live (beta)

Thursday, April 17th, 2014

Cloudera Live (beta)

From the webpage:

Try a live demo of Hadoop, right now.

Cloudera Live is a new way to get started with Apache Hadoop, online. No downloads, no installations, no waiting. Watch tutorial videos and work with real-world examples of the complete Hadoop stack included with CDH, Cloudera’s completely open source Hadoop platform, to:

  • Learn Hue, the Hadoop User Interface developed by Cloudera
  • Query data using popular projects like Apache Hive, Apache Pig, Impala, Apache Solr, and Apache Spark (new!)
  • Develop workflows using Apache Oozie

Great news for people interested in Hadoop!

Question: Will this become the default delivery model for test driving software and training?


Hortonworks Data Platform 2.1

Wednesday, April 2nd, 2014

Hortonworks Data Platform 2.1 by Jim Walker.

From the post:

The pace of innovation within the Apache Hadoop community is truly remarkable, enabling us to announce the availability of Hortonworks Data Platform 2.1, incorporating the very latest innovations from the Hadoop community in an integrated, tested, and completely open enterprise data platform.

A VM available now, full releases to follow later in April.

Just grabbing the headings from Jim’s post:

The Stinger Initiative: Apache Hive, Tez and YARN for Interactive Query

Data Governance with Apache Falcon

Security with Apache Knox

Stream Processing with Apache Storm

Searching Hadoop Data with Apache Solr

Advanced Operations with Apache Ambari

See Jim’s post for some of the details and the VM for others.

Use Parquet with Impala, Hive, Pig, and MapReduce

Saturday, March 22nd, 2014

How-to: Use Parquet with Impala, Hive, Pig, and MapReduce by John Russell.

From the post:

The CDH software stack lets you use your tool of choice with the Parquet file format – – offering the benefits of columnar storage at each phase of data processing.

An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:

  • Columnar storage layout: A query can examine and perform calculations on all values for a column while reading only a small fraction of the data from a data file or table.
  • Flexible compression options: The data can be compressed with any of several codecs. Different data files can be compressed differently. The compression is transparent to applications that read the data files.
  • Innovative encoding schemes: Sequences of identical, similar, or related data values can be represented in ways that save disk space and memory, yet require little effort to decode. The encoding schemes provide an extra level of space savings beyond the overall compression for each data file.
  • Large file size: The layout of Parquet data files is optimized for queries that process large volumes of data, with individual files in the multi-megabyte or even gigabyte range.

Impala can create Parquet tables, insert data into them, convert data from other file formats to Parquet, and then perform SQL queries on the resulting data files. Parquet tables created by Impala can be accessed by Apache Hive, and vice versa.

That said, the CDH software stack lets you use the tool of your choice with the Parquet file format, for each phase of data processing. For example, you can read and write Parquet files using Apache Pig and MapReduce jobs. You can convert, transform, and query Parquet tables through Impala and Hive. And you can interchange data files between all of those components — including ones external to CDH, such as Cascading and Apache Tajo.

In this blog post, you will learn the most important principles involved.

Since I mentioned ROOT files yesterday, I am curious what you make of the use of Thrift metadata definitions to read Parquet files?

It’s great that data can be documented for reading, but reading doesn’t imply to me that its semantics have been captured.

A wide variety of products read data, less certain they can document data semantics.


I first saw this in a tweet by Patrick Hunt.

Merge Mahout item based recommendations…

Saturday, March 8th, 2014

Merge Mahout item based recommendations results from different algorithms

From the post:

Apache Mahout is a machine learning library that leverages the power of Hadoop to implement machine learning through the MapReduce paradigm. One of the implemented algorithms is collaborative filtering, the most successful recommendation technique to date. The basic idea behind collaborative filtering is to analyze the actions or opinions of users to recommend items similar to the one the user is interacting with.

Similarity isn’t restricted to a particular measure or metric.

How similar is enough to be considered the same?

That is a question topic map designers must answer on a case by case basis.

Tutorial 1: Hello World… [Hadoop/Hive/Pig]

Monday, January 27th, 2014

Tutorial 1: Hello World – An Overview of Hadoop with Hive and Pig

Don’t be frightened!

The tutorial really doesn’t use big data tools to quickly say “Hello World” or to even say it quickly, many times. 😉

One of the clearer tutorials on big data tools.

You won’t quite be dangerous by the time you finish this tutorial but you should have a strong enough taste of the tools to want more.


Enron, Email, Kiji, Hive, YARN, Tez (Jan. 7th, DC)

Monday, January 6th, 2014

Exploring Enron Email Dataset with Kiji and Hive; Apache YARN and Apache Tez Hadoop-DC.

Tuesday, January 7, 2014 6:00 PM to 9:30 PM
Neustar (Room: Neuview) 21575 Ridgetop Circle, Sterling, VA

From the webpage:

Exploring Enron Email Dataset with Kiji and Hive

Lee Sheng, WibiData

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop that provides SQL based access for exploring datasets. KijiSchema provides evolvable schemas of primitive and compound types on top of HBase. The integration between these provides the best aspects of both worlds (ad hoc SQL based querying on top of datasets using evolvable schemas containing complex objects). This talk will present an examples of queries utilizing this integration to do exploratory analysis of the Enron email corpus. Delving into topics such as email responder pairs and sentiment analysis can expose many of the interesting points in the rise and fall of Enron.

Apache YARN & Apache Tez

Tom McCuch Technical Director, Hortonworks

Apache Hadoop has become synonymous with Big Data and powers large scale data processing across some of the biggest companies in the world. Hadoop 2 is the next generation release of Hadoop and marks a pivotal point in its maturity with YARN – the new Hadoop compute framework. YARN – Yet Another Resource Negotiator – is a complete re-architecture of the Hadoop compute stack with a clean separation between platform and application. This opens up Hadoop data processing to new applications that can be executed IN Hadoop instead of outside Hadoop, thus improving efficiency, performance, data sharing and lowering operation costs. The Big Data ecosystem is already converging on YARN with new applications like Apache Tez being written specifically for YARN. Apache Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. The talk will provide a brief overview of key Hadoop 2 innovations, focusing in on YARN and Tez – covering architecture, motivational use cases and future roadmap. Finally, the impact of YARN on the Hadoop community will be demonstrated through running interactive queries with both Hive on Tez and with Hive on MapReduce, and comparing their performance side-by-side on the same Hadoop 2 cluster.

When I saw the low tomorrow in DC is going to be 16F and the high 21F, I thought I should pass this along.

Does anyone have a very large set of phone metadata that is public?

Thinking rather than grinding over Enron’s stumbles, again, phone metadata could be hands-on training for a variety of careers. 😉

Looking forward to seeing videos of these presentations!

Impala v Hive

Sunday, December 22nd, 2013

Impala v Hive by Mike Olson.

From the post:

We introduced Cloudera Impala more than a year ago. It was a good launch for us — it made our platform better in ways that mattered to our customers, and it’s allowed us to win business that was previously unavailable because earlier products simply couldn’t tackle interactive SQL workloads.

As a side effect, though, that launch ignited fierce competition among vendors for SQL market share in the Apache Hadoop ecosystem, with claims and counter-claims flying. Chest-beating on performance abounds (and we like our numbers pretty well), but I want to approach the matter from a different direction here.

I get asked all the time about Cloudera’s decision to develop Impala from the ground up as a new project, rather than improving the existing Apache Hive project. If there’s existing code, the thinking goes, surely it’s best to start there — right?

Well, no. We thought long and hard about it, and we concluded that the best thing to do was to create a new open source project, designed on different principles from Hive. Impala is that system. Our experiences over the last year increase our conviction on that strategy.

Let me walk you through our thinking.

Mike makes a very good argument for building Impala.

Whether you agree with it or not, it centers on requirements and users.

I won’t preempt his argument here but suffice it to say that Cloudera saw the need for robust SQL support over Hadoop data stores and estimated user demand for a language like SQL versus a newer language like Pig.

Personally I found it refreshing for someone to explicitly consider user habits as opposed to a “…users need to learn the right way (my way) to query/store/annotate data…” type approach.

You know the outcome, now go read the reasons Cloudera made the decisions it did.

Cheat Sheet: Hive for SQL Users

Sunday, December 15th, 2013

Cheat Sheet: Hive for SQL Users

What looks like a very useful quick reference to have on or near your desk.

I say “looks like” because so far I haven’t found a way to capture the file for printing.

On Hortonworks (link above), it displays in a slideshare-like window. You can scroll, mail to to others, etc., but no save-file.

I searched for the title and found another copy at Slideshare.

If you are guessing that means I can save it to my Slideshare folder, right in one. That still doesn’t get it to my local machine.

It does have all the major social networks listed for you to share/embed the slides.

But why would I want to propagate this sort of annoyance?

Better that I ask readers of this blog to ping Hortonworks and ask that the no-download approach in:

not be repeated. (Politely. Hortonworks has done an enormous about of work on the Hadoop ecosystem, all on its own dime. This is probably just poor judgment on the part of a non-techie in a business office somewhere.)

Using Hive to interact with HBase, Part 2

Tuesday, December 3rd, 2013

Using Hive to interact with HBase, Part 2 by Nick Dimiduk.

From the post:

This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.

“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”

This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase

If you learn from concrete examples and then feel your way further out, you will love this post!

Using Hive to interact with HBase, Part 1

Monday, November 11th, 2013

Using Hive to interact with HBase, Part 1 by Nick Dimiduk.

From the post:

This is the first of two posts examining the use of Hive for interaction with HBase tables. Check back later in the week for the concluding article.

One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition 😉 These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.

The hive project includes an optional library for interacting with HBase. This is where the bridge layer between the two systems is implemented. The primary interface you use when accessing HBase from Hive queries is called the BaseStorageHandler. You can also interact with HBase tables directly via Input and Output formats, but the handler is simpler and works for most uses.

If you want to be on the edge of Hive/HBase interaction, start here.

Be forewarned that you are in a folklore, JIRA issue, etc., place but you will be ahead of the less brave.

How to use R … in MapReduce and Hive

Friday, November 8th, 2013

How to use R and other non-Java languages in MapReduce and Hive by Tom Hanlon.

From the post:

I teach for Hortonworks and in class just this week I was asked to provide an example of using the R statistics language with Hadoop and Hive. The good news was that it can easily be done. The even better news is that it is actually possible to use a variety of tools: Python, Ruby, shell scripts and R to perform distributed fault tolerant processing of your data on a Hadoop cluster.

In this blog post I will provide an example of using R, with Hive. I will also provide an introduction to other non-Java MapReduce tools.

If you wanted to follow along and run these examples in the Hortonworks Sandbox you would need to install R.

The Hortonworks Sandbox just keeps getting better!

Facebook’s Presto 10X Hive Speed (mostly)

Friday, November 8th, 2013

Facebook open sources its SQL-on-Hadoop engine, and the web rejoices by Derrick Harris.

From the post:

Facebook has open sourced Presto, the interactive SQL-on-Hadoop engine the company first discussed in June. Presto is Facebook’s take on Cloudera’s Impala or Google’s Dremel, and it already has some big-name fans in Dropbox and Airbnb.

Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database.

Presto is 10 times faster than Hive for most queries, according to Facebook software engineer Martin Traverso in a blog post detailing today’s news.

I think my headline is the more effective one. 😉

You won’t know anything until you download Presto, read the documentation, etc.

Presto homepage.

The first job is to get your attention, then you have to get the information necessary to be informed.

From Derrick’s post, which points to other SQL-on-Hadoop options, interesting times are ahead!

Delivering on Stinger:…

Thursday, October 31st, 2013

Delivering on Stinger: a Phase 3 Progress Update by Arun Murthy.

From the post:

With the attention of the Hadoop community on Strata/Hadoop World in New York this week, it’s seems an appropriate time to give everyone an early update on continued community development of Apache Hive. This progress well and truly cements Hive as the standard open-source SQL solution for the Apache Hadoop ecosystem for not just extremely large-scale, batch queries but also for low-latency, human-interactive queries.

Many of you have heard of Project Stinger already, but for those who have not, Stinger is a community-facing roadmap laid out to improve Hive’s performance 100x and bring true interactive query to Hadoop. You can read more at

We’ve gotten really excited lately as we’ve started to piece together the performance gains brought on by the past 9 months of hard work, including more than 700 closed Hive JIRAs and the launch of Apache Tez, which moves Hadoop beyond batch into a truly interactive big data platform.

I won’t replicate the performance graphics but I can hint that 200x improvements are worth your attention.

That’s right. 200x improvement in query performance.

Don’t take my word for it, read Arun’s post.

Hadoop Weekly – October 28, 2013

Tuesday, October 29th, 2013

Hadoop Weekly – October 28, 2013 by Joe Crobak.

A weekly blog post that tracks all things in the Hadoop ecosystem.

I will keep posting on Hadoop things of particular interest for topic maps but will also be pointing to this blog for those who want/need more Hadoop coverage.

Apache Hive 0.12: Stinger Phase Two… DELIVERED [Unlike Obamacare]

Monday, October 21st, 2013

Apache Hive 0.12: Stinger Phase Two… DELIVERED by Thejas Nair.

From the post:

Stinger is not a product. Stinger is a broad community based initiative to bring interactive query at petabyte scale to Hadoop. And today, as representatives of this open, community led effort we are very proud to announce delivery of Apache Hive 0.12, which represents the critical second phase of this project!

Only five months in the making, Apache Hive 0.12 comprises over 420 closed JIRA tickets contributed by ten companies, with nearly 150 thousand lines of code! This work is perfectly representative of our approach… it is a substantial release with major contributions from a wide group of talented engineers from Microsoft, Facebook , Yahoo and others.

Delivery of SQL-IN-Hadoop Marches

The Stinger Initiative was announced in February and as promised, we have seen consistent regular delivery of new features and improvements as outlined in the Stinger plan. There are three roadmap vectors for Stinger: Speed, Scale and SQL. Each phase of the initiative advances on all three goals and this release provides a significant increase in SQL semantics, adding the VARCHAR and DATE datatypes and improving performance ORDER by and GROUP by. Several features to optimize queries have also been added.

We also contributed numerous “under the hood” improvements, ie refactoring code and making it easier to build on top of hive – getting rid of some of the technical debt. This helps us deliver further optimizations in the long term, especially for the upcoming Apache Tez integration.

A complete list of the notable improvements included in the release is listed here and expect an updated performance benchmark soon!

It is so nice to see a successful software project!

And an open source one at that!

Unlike the no bid IT mega-failure that is Obamacare.

Maybe there is something to having a good infrastructure for code development as opposed to contractors billing by the phone call, lunch meeting and hour.

BTW, all the protests about the volume of users trying to register with Obamacare? More managerial incompetence.

When you are rolling out a system for potentially 300 million+ users, don’t you anticipate load as part of the requirements?

If you didn’t, there is the start of the trail of managerial incompetence in Obamacare.

Hadoop Tutorials – Hortonworks

Wednesday, October 16th, 2013

With the GA release of Hadoop 2, it seems appropriate to list a set of tutorials for the Hortonworks Sandbox.

Tutorial 1: Hello World – An Overview of Hadoop with HCatalog, Hive and Pig

Tutorial 2: How To Process Data with Apache Pig

Tutorial 3: How to Process Data with Apache Hive

Tutorial 4: How to Use HCatalog, Pig & Hive Commands

Tutorial 5: How to Use Basic Pig Commands

Tutorial 6: How to Load Data for Hadoop into the Hortonworks Sandbox

Tutorial 7: How to Install and Configure the Hortonworks ODBC driver on Windows 7

Tutorial 8: How to Use Excel 2013 to Access Hadoop Data

Tutorial 9: How to Use Excel 2013 to Analyze Hadoop Data

Tutorial 10: How to Visualize Website Clickstream Data

Tutorial 11: How to Install and Configure the Hortonworks ODBC driver on Mac OS X

Tutorial 12: How to Refine and Visualize Server Log Data

Tutorial 13: How To Refine and Visualize Sentiment Data

Tutorial 14: How To Analyze Machine and Sensor Data

By the time you finish these, I am sure there will be more tutorials or even proposed additions to the Hadoop stack!

(Updated December 3, 2013 to add #13 and #14.)

…Hive Functions in Hadoop

Sunday, September 22nd, 2013

Cheat Sheet: How To Work with Hive Functions in Hadoop by Marc Holmes.

From the post:

Just a couple of weeks ago we published our simple SQL to Hive Cheat Sheet. That has proven immensely popular with a lot of folk to understand the basics of querying with Hive. Our friends at Qubole were kind enough to work with us to extend and enhance the original cheat sheet with more advanced features of Hive: User Defined Functions (UDF). In this post, Gil Allouche of Qubole takes us from the basics of Hive through to getting started with more advanced uses, which we’ve compiled into another cheat sheet you can download here.

The cheat sheet will be useful but so is this observation in the conclusion of the post:

One of the key benefits of Hive is using existing SQL knowledge, which is a common skill found across business analysts, data analysts, software engineers, data scientist and others. Hive has nearly no barriers for new users to start exploring and analyzing data.

I’m sure use of existing SQL knowledge isn’t the only reason for Hive’s success, but the Hive PowerBy page shows it didn’t hurt!

Something to think about in creating a topic map query language. Yes, the queries executed by an engine will be traversing a topic map graph, but presenting it to the user as a graph query isn’t required.

Scaling Apache Giraph to a trillion edges

Friday, September 13th, 2013

Scaling Apache Giraph to a trillion edges by Avery Ching.

From the post:

Graph structures are ubiquitous: they provide a basic model of entities with connections between them that can represent almost anything. Flight routes connect airports, computers communicate to one another via the Internet, webpages have hypertext links to navigate to other webpages, and so on. Facebook manages a social graph that is composed of people, their friendships, subscriptions, and other connections. Open graph allows application developers to connect objects in their applications with real-world actions (such as user X is listening to song Y).

Analyzing these real world graphs at the scale of hundreds of billions or even a trillion (10^12) edges with available software was impossible last year. We needed a programming framework to express a wide range of graph algorithms in a simple way and scale them to massive datasets. After the improvements described in this article, Apache Giraph provided the solution to our requirements.

In the summer of 2012, we began exploring a diverse set of graph algorithms across many different Facebook products as well as academic literature. We selected a few representative use cases that cut across the problem space with different system bottlenecks and programming complexity. Our diverse use cases and the desired features of the programming framework drove the requirements for our system infrastructure. We required an iterative computing model, graph-based API, and fast access to Facebook data. Based on these requirements, we selected a few promising graph-processing platforms including Apache Hive, GraphLab, and Apache Giraph for evaluation.

For your convenience:

Apache Giraph

Apache Hive


Your appropriate scale is probably less than a trillion edges but everybody likes a great scaling story.

This is a great scaling story.

Stinger Phase 2:…

Thursday, September 5th, 2013

Stinger Phase 2: The Journey to 100x Faster Hive on Hadoop by Carter Shanklin.

From the post:

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.

If you don’t think SQL is all that weird, ;-), this is a status update for you!

Serious progress is being made by a broad coalition of more than 60 developers.

Take the challenge and download HDP 2.0 Beta.

You can help build the future of SQL-IN-Hadoop.

But only if you participate.

Simple Hive ‘Cheat Sheet’ for SQL Users

Wednesday, August 21st, 2013

Simple Hive ‘Cheat Sheet’ for SQL Users by Marc Holmes.

From the post:

If you’re already familiar with SQL then you may well be thinking about how to add Hadoop skills to your toolbelt as an option for data processing.

From a querying perspective, using Apache Hive provides a familiar interface to data held in a Hadoop cluster and is a great way to get started. Apache Hive is data warehouse infrastructure built on top of Apache Hadoop for providing data summarization, ad-hoc query, and analysis of large datasets. It provides a mechanism to project structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL (HQL).

Naturally, there are a bunch of differences between SQL and HiveQL, but on the other hand there are a lot of similarities too, and recent releases of Hive bring that SQL-92 compatibility closer still.

To highlight that – and as a bit of fun to get started – below is a simple ‘cheat sheet’ (based on a simple MySQL reference such as this one) for getting started with basic querying for Hive. Here, we’ve done a direct comparison to MySQL, but given the simplicity of these particular functions, then it should be the same in essentially any SQL dialect.

Of course, if you really want to get to grips with Hive, then take a look at the full language manual.

Definitely going to print this cheat sheet out and put it in plastic.

A top of the desk sort of reference.