Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 27, 2012

Hive, Pig, Scalding, Scoobi, Scrunch and Spark

Filed under: Hive,Pig,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 7:18 pm

Hive, Pig, Scalding, Scoobi, Scrunch and Spark by Sami Badawi.

From the post:

Comparison of Hadoop Frameworks

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

  • Pig
  • Scalding
  • Scoobi
  • Hive
  • Spark
  • Scrunch
  • Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.

Everyone has a favorite use case.

How does your use case fare with different frameworks for Hadoop? (We won’t ever know if you don’t say.)

January 12, 2012

Introducing DataFu: an open source collection of useful Apache Pig UDFs

Filed under: DataFu,Hadoop,MapReduce,Pig — Patrick Durusau @ 7:34 pm

Introducing DataFu: an open source collection of useful Apache Pig UDFs

From the post:

At LinkedIn, we make extensive use of Apache Pig for performing data analysis on Hadoop. Pig is a simple, high-level programming language that consists of just a few dozen operators and makes it easy to write MapReduce jobs. For more advanced tasks, Pig also supports User Defined Functions (UDFs), which let you integrate custom code in Java, Python, and JavaScript into your Pig scripts.

Over time, as we worked on data intensive products such as People You May Know and Skills, we developed a large number of UDFs at LinkedIn. Today, I’m happy to announce that we have consolidated these UDFs into a single, general-purpose library called DataFu and we are open sourcing it under the Apache 2.0 license:

Check out DataFu on GitHub!

DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a comprehensive suite of tests. Read on to learn more.

This is way cool!

Read the rest of Matthew’s post (link above) or get thee to GitHub!

January 1, 2012

Gora Graduates!

Filed under: Cassandra,Hadoop,HBase,Hive,Lucene,MapReduce,Pig,Solr — Patrick Durusau @ 5:54 pm

Gora Graduates! (Incubator location)

Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!

Congratulations to all involved.

Oh, the project:

What is Gora?

Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

November 21, 2011

Comparing High Level MapReduce Query Languages

Filed under: Hadoop,Hive,JAQL,MapReduce,Pig — Patrick Durusau @ 7:27 pm

Comparing High Level MapReduce Query Languages by R.J. Stewart, P.W. Trinder, and H-W. Loidl.

Abstract:

The MapReduce parallel computational model is of increasing importance. A number of High Level Query Languages (HLQLs) have been constructed on top of the Hadoop MapReduce realization, primarily Pig, Hive, and JAQL. This paper makes a systematic performance comparison of these three HLQLs, focusing on scale up, scale out and runtime metrics. We further make a language comparison of the HLQLs focusing on conciseness and computational power. The HLQL development communities are engaged in the study, which revealed technical bottlenecks and limitations described in this document, and it is impacting their development.

A starting place for watching these three HLQLs as they develop, which no doubt they will continue to do. And one expects them to be joined by other candidates so familiarity with this paper may help speed their evaluation as well.

November 11, 2011

Postgres Plus Connector for Hadoop

Filed under: Hadoop,MapReduce,Pig,PostgreSQL,SQL — Patrick Durusau @ 7:39 pm

Postgres Plus Connector for Hadoop

From the webpage:

The Postgres Plus Connector for Hadoop provides developers easy access to massive amounts of SQL data for integration with or analysis in Hadoop processing clusters. Now large amounts of data managed by PostgreSQL or Postgres Plus Advanced Server can be accessed by Hadoop for analysis and manipulation using Map-Reduce constructs.

EnterpriseDB recognized early on that Hadoop, a framework allowing distributed processing of large data sets across computer clusters using a simple programming model, was a valuable and complimentary data processing model to traditional SQL systems. Map-Reduce processing serves important needs for basic processing of extremely large amounts of data and SQL based systems will continue to fulfill their mission critical needs for complex processing of data well into the future. What was missing was an easy way for developers to access and move data between the two environments.

EnterpriseDB has created the Postgres Plus Connector for Hadoop by extending the Pig platform (an engine for executing data flows in parallel on Hadoop) and using an EnterpriseDB JDBC driver to allow users the ability to load the results of a SQL query into Hadoop where developers can operate on that data using familiar Map-Reduce programming. In addition, data from Hadoop can also be moved back into PostgreSQL or Postgres Plus Advanced Server tables.

A private beta is in progress, see the webpage for details and to register.

Plus, there is a webinar, Tuesday, November 29, 2011 11:00 am Eastern Standard Time (New York, GMT-05:00), Extending SQL Analysis with the Postgres Plus Connector for Hadoop. Registration at the webpage as well.

A step towards seamless data environments. Much like word processing now without the “.” commands. Same commands for the most part but unseen. Data is going in that direction. You will specify desired results and environments will take care of access, processor(s), operations and the like. Tables will appear as tables because you have chosen to view them as tables, etc.

October 26, 2011

SQL -> Pig Translation

Filed under: Pig,SQL — Patrick Durusau @ 6:57 pm

hadoop pig documentation

From the post:

It is sometimes difficult for SQL users to learn Pig because their mind is used to working in SQL. In this tutorial, examples of various SQL statements are shown, and then translated into Pig statements. For more detailed documentation, please see the official Pig manual.

This could be an effective technique for teaching Pig to SQL programmers. What do you think?

October 25, 2011

SquareCog’s SquareBlog

Filed under: Pig — Patrick Durusau @ 7:35 pm

SquareCog’s SquareBlog by Dmitriy Ryaboy.

Blog devoted mostly to Pig and related technologies.

October 22, 2011

Cloudera Training Videos

Filed under: Hadoop,HBase,Hive,MapReduce,Pig — Patrick Durusau @ 3:17 pm

Cloudera Training Videos

Cloudera has added several training videos on Hadoop and parts of the Hadoop ecosystem.

You will find:

  • Introduction to HBase – Todd Lipcon
  • Thinking at Scale
  • Introduction to Apache Pig
  • Introduction to Apache MapReduce and HDFS
  • Introduction to Apache Hive
  • Apache Hadoop Ecosystem
  • Hadoop Training Virtual Machine
  • Hadoop Training: Programming with Hadoop
  • Hadoop Training: MapReduce Algorithms

No direct links to the videos because new resources/videos will appear more quickly at the Cloudera site than I will be updating this list.

Now you have something to watch this weekend (Oct. 22-23, 2011) other than reports on and of the World Series! Enjoy!

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

Filed under: Hadoop,Natural Language Processing,Pig — Patrick Durusau @ 3:16 pm

Mining Wikipedia with Hadoop and Pig for Natural Language Processing

One problem with after-the-fact assignment of semantics to text is that the volume of text involved (usually) is too great for manual annotation.

This post walks you through the alternative of using automated annotation based upon Wikipedia content.

From the post:

Instead manually of annotating text, one should try to benefit from an existing annotated and publicly available text corpus that deals with a wide range of topics, namely Wikipedia.

Our approach is rather simple: the text body of Wikipedia articles is rich in internal links pointing to other Wikipedia articles. Some of those articles are referring to the entity classes we are interested in (e.g. person, countries, cities, …). Hence we just need to find a way to convert those links into entity class annotations on text sentences (without the Wikimarkup formatting syntax).

This is also an opportunity to try out cloud based computing if you are so inclined.

October 21, 2011

CDH3 update 2 is released (Apache Hadoop)

Filed under: Hadoop,Hive,Mahout,MapReduce,Pig — Patrick Durusau @ 7:27 pm

CDH3 update 2 is released (Apache Hadoop)

From the post:

There are a number of improvements coming to CDH3 with update 2. Among them are:

  1. New features – Support for Apache Mahout (0.5). Apache Mahout is a popular machine learning library that makes it easier for users to perform analyses like collaborative filtering and k-means clustering on Hadoop. Also added in update 2 is expanded support for Apache Avro’s data file format. Users can:
  • load data into Avro data files in Hadoop via Sqoop or Flume
  • run MapReduce, Pig or Hive workloads on Avro data files
  • view the contents of Avro files from the Hue web client

This gives users the ability to use all the major features of the Hadoop stack without having to switch file formats. Avro file format provides added benefits over text because it is faster and more compact.

  1. Improvements (stability and performance) – HBase in particular has received a number of improvements that improve stability and recoverability. All HBase users are encouraged to use update 2.
  2. Bug fixes – 50+ bug fixes. The enumerated fixes and their corresponding Apache project jiras are provided in the release notes.

Update 2 is available in all the usual formats (RHEL, SLES, Ubuntu, Debian packages, tarballs, and SCM Express). Check out the installation docsfor instructions. If you’re running components from the Cloudera Management Suite they will not be impacted by moving to update 2. The next update (update 3) for CDH3 is planned for January, 2012.

Thank you for supporting Apache Hadoop and thank you for supporting Cloudera.

Another aspect of Cloudera’s support for the Hadoop ecosystem is its Cloudera University.

October 16, 2011

Hadoop User Group UK: Data Integration

Filed under: Data Integration,Flume,Hadoop,MapReduce,Pig,Talend — Patrick Durusau @ 4:12 pm

Hadoop User Group UK: Data Integration

Three presentations captured as podcasts from the Hadoop User Group UK:

LEVERAGING UNSTRUCTURED DATA STORED IN HADOOP

FLUME FOR DATA LOADING INTO HDFS / HIVE (SONGKICK)

LEVERAGING MAPREDUCE WITH TALEND: HADOOP, HIVE, PIG, AND TALEND FILESCALE

Fresh as of 13 October 2011.

Thanks to Skills Matter for making the podcasts available!

September 14, 2011

Yahoo! Hadoop Tutorial

Filed under: Hadoop,MapReduce,Pig — Patrick Durusau @ 7:03 pm

Yahoo! Hadoop Tutorial

From the webpage:

Welcome to the Yahoo! Hadoop Tutorial. This tutorial includes the following materials designed to teach you how to use the Hadoop distributed data processing environment:

  • Hadoop 0.18.0 distribution (includes full source code)
  • A virtual machine image running Ubuntu Linux and preconfigured with Hadoop
  • VMware Player software to run the virtual machine image
  • A tutorial which will guide you through many aspects of Hadoop’s installation and operation.

The tutorial is divided into seven modules, designed to be worked through in order. They can be accessed from the links below.

  1. Tutorial Introduction
  2. The Hadoop Distributed File System
  3. Getting Started With Hadoop
  4. MapReduce
  5. Advanced MapReduce Features
  6. Related Topics
  7. Managing a Hadoop Cluster
  8. Pig Tutorial

You can also download this tutorial as a single .zip file and burn a CD for use, and easy distribution, offline.

August 25, 2011

PageRank Implementation in Pig

Filed under: Pig,Software — Patrick Durusau @ 6:59 pm

PageRank Implementation in Pig

Simple implementation of PageRank using Pig. Think of it as an easy intro to Pig.

If you don’t know Pig, see: Pig. 😉 Sorry.

Saw this in NoSQL Weekly (Issue 39 – Aug 25, 2011). I can’t point you to the issue, the NoSQL Weekly site reports “beta” and asks if you want a sample copy.

August 1, 2011

Pig with Cassandra: Adventures in Analytics

Filed under: Cassandra,Pig,Pygmalion — Patrick Durusau @ 3:54 pm

Pig with Cassandra: Adventures in Analytics

Suggestions for slide 6 that reads in part:

Pygmalion

Figure in Greek Mythology, sounds like Pig

True enough but in terms of a control language, the play Pygmalion by Shaw would have been the better reference.

I presume the reader/listener would get the sound similarity without prompting.

Sorry, read the slide deck and see the source code at: https://github.com/jeromatron/pygmalion/.

July 23, 2011

Introduction to Oozie

Filed under: Hadoop,MapReduce,Oozie,Pig — Patrick Durusau @ 3:10 pm

Introduction to Oozie

From the post:

Tasks performed in Hadoop sometimes require multiple Map/Reduce jobs to be chained together to complete its goal. [1] Within the Hadoop ecosystem, there is a relatively new component Oozie [2], which allows one to combine multiple Map/Reduce jobs into a logical unit of work, accomplishing the larger task. In this article we will introduce Oozie and some of the ways it can be used.

What is Oozie ?

Oozie is a Java Web-Application that runs in a Java servlet-container – Tomcat and uses a database to store:

  • Workflow definitions
  • Currently running workflow instances, including instance states and variables

Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).

Workflow management for Hadoop!

July 21, 2011

Wonderdog

Filed under: ElasticSearch,Hadoop,Pig — Patrick Durusau @ 6:30 pm

Wonderdog

From the webpage:

Wonderdog is a Hadoop interface to Elastic Search. While it is specifically intended for use with Apache Pig, it does include all the necessary Hadoop input and output formats for Elastic Search. That is, it’s possible to skip Pig entirely and write custom Hadoop jobs if you prefer.

I may just be paying more attention but the search scene seems to be really active.

That’s good for topic maps because the more data that is searched, the greater the likelihood of heterogeneous data. Text messages between teens are probably heterogeneous but who cares?

Medical researchers using different terminology results in heterogeneous data, not just today, but data from yesteryear. Now that could be important.

June 3, 2011

IBM InfoSphere BigInsights

Filed under: Avro,BigInsights,Hadoop,HBase,Lucene,Pig,Zookeeper — Patrick Durusau @ 2:32 pm

IBM InfoSphere BigInsights

Two items stand out in the usual laundry list of “easy administration” and “IBM supports open source” list of claims:

The Jaql query language. Jaql, a Query Language for JavaScript Object Notation (JSON), provides the capability to process both structured and non-traditional data. Its SQL-like interface is well suited for quick ramp-up by developers familiar with the SQL language and makes it easier to integrate with relational databases.

….

Integrated installation. BigInsights includes IBM value-added technologies, as well as open source components, such as Hadoop, Lucene, Hive, Pig, Zookeeper, Hbase, and Avro, to name a few.

I guess it must include a “few” things since the 64-bit Linux download is 398 MBs.

Just pointing out its availability. More commentary to follow.

April 24, 2011

Hadoop2010: Hadoop and Pig at Twitter

Filed under: Hadoop,Pig — Patrick Durusau @ 5:33 pm

Hadoop2010: Hadoop and Pig at Twitter video of presentation by Kevin Weil.

From the description:

Apache Pig is a high-level framework built on top of Hadoop that offers a powerful yet vastly simplified way to analyze data in Hadoop. It allows businesses to leverage the power of Hadoop in a simple language readily learnable by anyone that understands SQL. In this presentation, Twitter’s Kevin Weil introduces Pig and shows how it has been used at Twitter to solve numerous analytics challenges that had become intractable with a former MySQL-based architecture.

I started to simply do the listing for the Hadoop Summit 2010 but the longer I watched Kevin’s presentation, the more I thought it needed to be singled out.

If you don’t already know Pig, you will be motivated to learn Pig after this presentation.

Pig and Hive at Yahoo!

Filed under: Hive,Pig — Patrick Durusau @ 5:32 pm

Pig and Hive at Yahoo!

Observations on how and why to use Pig and/or Hive.

April 22, 2011

Geo Analytics Tutorial – Where 2.0 2011

Filed under: Geo Analytics,Hadoop,Mechanical Turk,Pig — Patrick Durusau @ 1:04 pm

Geo Analytics Tutorial – Where 2.0 2011

Very cool set of slides on geo analytics from Pete Skomoroch.

Includes use of Hadoop, Pig, Mechanical Turk.

« Newer Posts

Powered by WordPress