Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 4, 2012

March 2012 Bay Area HBase User Group meetup summary

Filed under: HBase,Hive — Patrick Durusau @ 2:31 pm

March 2012 Bay Area HBase User Group meetup summary by David S. Wang.

Let’s see:

  • …early access program – HBase In Action
  • …recent HBase releases
  • …Moving HBase RPC to protobufs
  • …Comparing the native HBase client and asynchbase
  • …Using Apache Hive with HBase: Recent improvements
  • …backups and snapshots in HBase
  • …Apache HBase PMC meeting

Do you need any additional reasons to live in the Bay Area? 😉

Seriously, if you do, take advantage of the opportunity meetings like this one offer.

If you don’t, might be cheaper than air fare to create you own HBase/Hive ecosystem.

March 27, 2012

Hive, Pig, Scalding, Scoobi, Scrunch and Spark

Filed under: Hive,Pig,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 7:18 pm

Hive, Pig, Scalding, Scoobi, Scrunch and Spark by Sami Badawi.

From the post:

Comparison of Hadoop Frameworks

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

  • Pig
  • Scalding
  • Scoobi
  • Hive
  • Spark
  • Scrunch
  • Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.

Everyone has a favorite use case.

How does your use case fare with different frameworks for Hadoop? (We won’t ever know if you don’t say.)

March 26, 2012

Measuring User Retention with Hadoop and Hive

Filed under: Hadoop,Hive,Marketing — Patrick Durusau @ 6:35 pm

Measuring User Retention with Hadoop and Hive by Daniel Russo.

From the post:

The Hadoop ecosystem is comprised of numerous tech­nologies that can work together to provide a powerful and scalable mech­anism for analyzing and deriving insight from large quan­tities of data.

In an effort to showcase the flex­i­bility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demon­strate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Motivation: Why User Retention?

Broadly speaking, when equipped with the appro­priate tools and data, we can enable our team and our customers to better under­stand the factors that drive user engagement and to ulti­mately make deci­sions that deliver better products to market.

User retention measures speak to the core of product quality by answering a crucial question about how the product resonates with users. In the case of apps (mobile or otherwise), that question is: “how many days does it take for users to stop using (or unin­stall) the app?”.

Pinch Media (now Flurry) delivered a formative presentation early in the AppStore’s history. Among numerous insights collected from their dataset was the following slide, which detailed patterns in user retention across all apps imple­menting their tracking SDK:

I mention this example because:

  • User retention is the measure of an app’s success or failure.*
  • Hadoop and Hive skill sets are good ones pick up.

* I have a pronounced fondness for requirements and the documenting of the same. Others prefer unit/user/interface/final tests. Still others prefer formal proofs of “correctness.” All pale beside the test of “user retention.” If users keep using an application, what other measure would be meaningful?

February 13, 2012

Big Data analytics with Hive and iReport

Filed under: Hadoop,Hive,iReport — Patrick Durusau @ 8:19 pm

Big Data analytics with Hive and iReport

From the post:

Each J.J. Abrams’ TV series Person of Interest episode starts with the following narration from Mr. Finch one of the leading characters: “You are being watched. The government has a secret system–a machine that spies on you every hour of every day. I know because…I built it.” Of course us technical people know better. It would take a huge team of electrical and software engineers many years to build such a high performing machine and the budget would be unimaginable… or wouldn’t be? Wait a second we have Hadoop! Now everyone of us can be Mr. Finch for a modest budget thanks to Hadoop.

In JCG article “Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed” JCG partner Rahul Patodi explained how to setup Hadoop. The Hadoop project has produced a lot of tools for analyzing semi-structured data but Hive is perhaps the most intuitive among them as it allows anyone with an SQL background to submit MapReduce jobs described as SQL queries. Hive can be executed from a command line interface, as well as run in a server mode with a Thrift client acting as a JDBC/ODBC interface giving access to data analysis and reporting applications.

In this article we will set up a Hive Server, create a table, load it with data from a text file and then create a Jasper Resport using iReport. The Jasper Report executes an SQL query on the Hive Server that is then translated to a MapReduce job executed by Hadoop.

Just in case you have ever wanted to play the role of “Big Brother.” 😉

On the other hand, the old adage about a good defense being a good offense may well be true.

Competing with other governments, organizations, companies, agencies or even inside them.

February 3, 2012

Karmasphere Studio Community Edition

Filed under: Hadoop,Hive,Karmasphere — Patrick Durusau @ 4:52 pm

Karmasphere Studio Community Edition

From the webpage:

Karmasphere Studio Community Edition is the free edition of our graphical development environment that facilitates learning Hadoop MapReduce jobs. It supports the prototyping, developing, and testing phases of the Hadoop development lifecycle.

The parallel and parameterized queries features in their Analyst product attracted me to the site:

From the webpage:

According to Karmasphere, the updated version of Analyst offers a parallel query capability that they say will make it faster for data analysts to iteratively query their data and create visualizations. The company claims that the new update allows data analysts to submit queries, view results, submit a new set and then compare those results across the previous outputs. In essence, this means users can run an unlimited number of queries concurrently on Hadoop so that one or more data sets can be viewed while the others are being generated.

Karmasphere also says that the introduction of parameterized queries allows users to submit their queries as they go, while offering them output in easy-to-read graphical representations of the findings, in Excel spreadsheets, or across a number of other outside reporting tools.

Hey, it says “…in Excel spreadsheets,” do you think they are reading my blog? (Spreadsheet -> Topic Maps: Wrong Direction? 😉 I didn’t really think so either.) I do take that as validation of the idea that offering users a familiar interface is more likely to be successful than an unfamiliar one.

January 26, 2012

Measuring User Retention with Hadoop and Hive

Filed under: Hadoop,Hive — Patrick Durusau @ 6:53 pm

Measuring User Retention with Hadoop and Hive by Daniel Russo.

From the post:

The Hadoop ecosystem is comprised of numerous tech­nologies that can work together to provide a powerful and scalable mech­anism for analyzing and deriving insight from large quan­tities of data.

In an effort to showcase the flex­i­bility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demon­strate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Compelling demonstration of the power of Hadoop and Hive to measure raw user retention, in an “app” situation.

Question:

User retention isn’t a new issue, does anyone know what strategies were used before Hadoop and Hive to measure it?

The reason I ask is that prior analysis of user retention may point the way towards data or relationships it wasn’t possible to capture before.

For example, when an app falls into non-use or is uninstalled, what impact (if any) does that have on known “friends” and their use of the app?

Are there any patterns to non-use/uninstalls over short or long periods of time in identifiable groups? (A social behavior type question.)

January 1, 2012

Gora Graduates!

Filed under: Cassandra,Hadoop,HBase,Hive,Lucene,MapReduce,Pig,Solr — Patrick Durusau @ 5:54 pm

Gora Graduates! (Incubator location)

Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!

Congratulations to all involved.

Oh, the project:

What is Gora?

Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

November 23, 2011

Coming Attractions: Apache Hive 0.8.0

Filed under: Hadoop,Hive — Patrick Durusau @ 7:52 pm

Coming Attractions: Apache Hive 0.8.0 by Carl Steinbach.

Apache Hive 0.8.0 won’t arrive for several weeks yet, but Carl’s preview covers:

  • Bitmap Indexes
  • TIMESTAMP datatype
  • Plugin Developer Kit
  • JDBC Driver Improvements

Are you interested now? Wondering what else will be included? Could always visit the Apache Hive project to find out. 😉

November 21, 2011

Comparing High Level MapReduce Query Languages

Filed under: Hadoop,Hive,JAQL,MapReduce,Pig — Patrick Durusau @ 7:27 pm

Comparing High Level MapReduce Query Languages by R.J. Stewart, P.W. Trinder, and H-W. Loidl.

Abstract:

The MapReduce parallel computational model is of increasing importance. A number of High Level Query Languages (HLQLs) have been constructed on top of the Hadoop MapReduce realization, primarily Pig, Hive, and JAQL. This paper makes a systematic performance comparison of these three HLQLs, focusing on scale up, scale out and runtime metrics. We further make a language comparison of the HLQLs focusing on conciseness and computational power. The HLQL development communities are engaged in the study, which revealed technical bottlenecks and limitations described in this document, and it is impacting their development.

A starting place for watching these three HLQLs as they develop, which no doubt they will continue to do. And one expects them to be joined by other candidates so familiarity with this paper may help speed their evaluation as well.

November 20, 2011

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Filed under: Crunch,Dremel,Dryad,Flume,Giraph,HBase,HDFS,Hive,JDBC,MapReduce,ODBC,Oozie,Pregel — Patrick Durusau @ 4:21 pm

Jeff Hammerbacher on Experiences Evolving a New Analytical Platform

Slides from Jeff’s presentation and numerous references, including to a live blogging summary by Jeff Dalton.

In terms of the new analytical platform, I would strongly suggest that you take Cloudera’s substrate:

Cloudera starts with a substrate architecture of Open Compute commodity Linux servers configured using Puppet and Chef and coordinated using ZooKeeper. Naturally this entire stack is open-source. They use HFDS and Ceph to provide distributed, schema-less storage. They offer append-only table storage and metadata using Avro, RCFile, and HCatalog; and mutable table storage and metadata using HBase. For computation, they offer YARN (inter-job scheduling, like Grid Engine, for data intensive computing) and Mesos for cluster resource management; MapReduce, Hamster (MPI), Spark, Dryad / DryadLINQ, Pregel (Giraph), and Dremel as processing frameworks; and Crunch (like Google’s FlumeJava), PigLatin, HiveQL, and Oozie as high-level interfaces. Finally, Cloudera offers tool access through FUSE, JDBC, and ODBC; and data ingest through Sqoop and Flume.

Rather than asking the usual questions, how to make this faster, more storage, etc., all of which are important, ask the more difficult questions:

  1. In or between which of these elements, would human analysis/judgment have the greatest impact?
  2. Would human analysis/judgment be best made by experts or crowds?
  3. What sort of interface would elicit the best human analysis/judgment? (visual/aural; contest/game/virtual)
  4. Performance with feedback or homeostasis mechanisms?

That is a very crude and uninformed starter set of questions.

Putting higher speed access to more data with better tools at our fingertips expands the questions we can ask of interfaces and our interaction with the data. (Before we ever ask questions of the data.)

November 8, 2011

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

Filed under: Conferences,Hadoop,HBase,Hive,MySQL,Oracle,Toad — Patrick Durusau @ 7:46 pm

Toad Virtual Expo – 11.11.11 – 24-hour Toad Event

From the website:

24 hours of Toad is here! Join us on 11.11.11, and take an around the world journey with Toad and database experts who will share database development and administration best practices. This is your chance to see new products and new features in action, virtually collaborate with other users – and Quest’s own experts, and get a first-hand look at what’s coming in the world of Toad.

If you are not going to see the Immortals on 11.11.11 or looking for something to do after the movie, drop in on the Toad Virtual Expo! 😉 (It doesn’t look like a “chick” movie anyway.)

Times:

Register today for Quest Software’s 24-hour Toad Virtual Expo and learn why the best just got better.

  1. Tokyo Friday, November 11, 2011 6:00 a.m. JST – Saturday, November 12, 2011 6:00 a.m. JST
  2. Sydney Friday, November 11, 2011 8:00 a.m. EDT – Saturday, November 12, 2011 8:00 a.m. EDT

  3. Tel Aviv Thursday, November 10, 2011 11:00 p.m. IST – Friday, November 11, 2011 11:00 p.m. IST
  4. Central Europe Thursday, November 10, 2011 10:00 p.m. CET – Friday, November 11, 2011 10:00 p.m. CET
  5. London Thursday, November 10, 2011 9:00 p.m. GMT – Friday, November 11, 2011 9:00 p.m. GMT
  6. New York Thursday, November 10, 2011 4:00 p.m. EST – Friday, November 11, 2011 4:00 p.m. EST
  7. Los Angeles Thursday, November 10, 2011 1:00 p.m. PST – Friday, November 11, 2011 1:00 p.m. PST

The site wasn’t long on specifics but this could be fun!

Toad for Cloud Databases (Quest Software)

Filed under: BigData,Cloud Computing,Hadoop,HBase,Hive,MySQL,Oracle,SQL Server — Patrick Durusau @ 7:45 pm

Toad for Cloud Databases (Quest Software)

From the news release:

The data management industry is experiencing more disruption than at any other time in more than 20 years. Technologies around cloud, Hadoop and NoSQL are changing the way people manage and analyze data, but the general lack of skill sets required to manage these new technologies continues to be a significant barrier to mainstream adoption. IT departments are left without a clear understanding of whether development and DBA teams, whose expertise lies with traditional technology platforms, can effectively support these new systems. Toad® for Cloud Databases addresses the skill-set shortage head-on, empowering database professionals to directly apply their existing skills to emerging Big Data systems through an easy-to-use and familiar SQL-based interface for managing non-relational data. 

News Facts:

  • Toad for Cloud Databases is now available as a fully functional, commercial-grade product, for free, at www.quest.com/toad-for-cloud-databases.  Toad for Cloud Databases enables users to generate queries, migrate, browse, and edit data, as well as create reports and tables in a familiar SQL view. By simplifying these tasks, Toad for Cloud Databases opens the door to a wider audience of developers, allowing more IT teams to experience the productivity gains and cost benefits of NoSQL and Big Data.
  • Quest first released Toad for Cloud Databases into beta in June 2010, making the company one of the first to provide a SQL-based database management tool to support emerging, non-relational platforms. Over the past 18 months, Quest has continued to drive innovation for the product, growing its list of supported platforms and integrating a UI for its bi-directional data connector between Oracle and Hadoop.
  • Quest’s connector between Oracle and Hadoop, available within Toad for Cloud Databases, delivers a fast and scalable method for data transfer between Oracle and Hadoop in both directions. The bidirectional characteristic of the utility enables organizations to take advantage of Hadoop’s lower cost of storage and analytical capabilities. Quest also contributed the connector to the Apache Hadoop project as an extension to the existing SQOOP framework, and is also available as part of Cloudera’s Distribution Including Apache Hadoop. 
  • Toad for Cloud Databases today supports:
    • Apache Hive
    • Apache HBase
    • Apache Cassandra
    • MongoDB
    • Amazon SimpleDB
    • Microsoft Azure Table Services
    • Microsoft SQL Azure, and
    • All Open Database Connectivity (ODBC)-enabled relational databases (Oracle, SQL Server, MySQL, DB2, etc)

 

Anything that eases the transition to cloud computing is going to be welcome. Toad being free will increase the ranks of DBAs who will at least experiment on their own.

October 22, 2011

Cloudera Training Videos

Filed under: Hadoop,HBase,Hive,MapReduce,Pig — Patrick Durusau @ 3:17 pm

Cloudera Training Videos

Cloudera has added several training videos on Hadoop and parts of the Hadoop ecosystem.

You will find:

  • Introduction to HBase – Todd Lipcon
  • Thinking at Scale
  • Introduction to Apache Pig
  • Introduction to Apache MapReduce and HDFS
  • Introduction to Apache Hive
  • Apache Hadoop Ecosystem
  • Hadoop Training Virtual Machine
  • Hadoop Training: Programming with Hadoop
  • Hadoop Training: MapReduce Algorithms

No direct links to the videos because new resources/videos will appear more quickly at the Cloudera site than I will be updating this list.

Now you have something to watch this weekend (Oct. 22-23, 2011) other than reports on and of the World Series! Enjoy!

October 21, 2011

CDH3 update 2 is released (Apache Hadoop)

Filed under: Hadoop,Hive,Mahout,MapReduce,Pig — Patrick Durusau @ 7:27 pm

CDH3 update 2 is released (Apache Hadoop)

From the post:

There are a number of improvements coming to CDH3 with update 2. Among them are:

  1. New features – Support for Apache Mahout (0.5). Apache Mahout is a popular machine learning library that makes it easier for users to perform analyses like collaborative filtering and k-means clustering on Hadoop. Also added in update 2 is expanded support for Apache Avro’s data file format. Users can:
  • load data into Avro data files in Hadoop via Sqoop or Flume
  • run MapReduce, Pig or Hive workloads on Avro data files
  • view the contents of Avro files from the Hue web client

This gives users the ability to use all the major features of the Hadoop stack without having to switch file formats. Avro file format provides added benefits over text because it is faster and more compact.

  1. Improvements (stability and performance) – HBase in particular has received a number of improvements that improve stability and recoverability. All HBase users are encouraged to use update 2.
  2. Bug fixes – 50+ bug fixes. The enumerated fixes and their corresponding Apache project jiras are provided in the release notes.

Update 2 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). Check out the installation docsfor instructions. If you’re running components from the Cloudera Management Suite they will not be impacted by moving to update 2. The next update (update 3) for CDH3 is planned for January, 2012.

Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.

Another aspect of Cloudera’s support for the Hadoop ecosystem is its Cloudera University.

April 24, 2011

Pig and Hive at Yahoo!

Filed under: Hive,Pig — Patrick Durusau @ 5:32 pm

Pig and Hive at Yahoo!

Observations on how and why to use Pig and/or Hive.

April 4, 2011

Apache Hive 0.70 Released!

Filed under: Hadoop,Hive — Patrick Durusau @ 6:31 pm

Apache Hive 0.70 Released!

I count thirty-four (34) new features so I am not going to list them here. Improvements as well.

« Newer Posts

Powered by WordPress