Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 27, 2012

Hive, Pig, Scalding, Scoobi, Scrunch and Spark

Filed under: Hive,Pig,Scalding,Scoobi,Scrunch,Spark — Patrick Durusau @ 7:18 pm

Hive, Pig, Scalding, Scoobi, Scrunch and Spark by Sami Badawi.

From the post:

Comparison of Hadoop Frameworks

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

  • Pig
  • Scalding
  • Scoobi
  • Hive
  • Spark
  • Scrunch
  • Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.

Everyone has a favorite use case.

How does your use case fare with different frameworks for Hadoop? (We won’t ever know if you don’t say.)

September 6, 2011

Hadoop Fatigue — Alternatives to Hadoop

Filed under: GraphLab,Hadoop,HPCC,MapReduce,Spark,Storm — Patrick Durusau @ 7:15 pm

Hadoop Fatigue — Alternatives to Hadoop

Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?

From the post:

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

  • Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
  • Documentation for the bloated Java API is sufficient, but not the most helpful.
  • HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
  • Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
  • Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
  • Large clusters require a dedicated team to keep it running properly, but that is not surprising.
  • Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.

Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.

It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.

June 27, 2011

Spark – Lighting-Fast Cluster Computing

Filed under: Clustering (servers),Data Analysis,Scala,Spark — Patrick Durusau @ 6:39 pm

Spark – Lighting-Fast Cluster Computing

From the webpage:

What is Spark?

Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce.

To make programming faster, Spark integrates into the Scala language, letting you manipulate distributed datasets like local collections. You can also use Spark interactively to query big data from the Scala interpreter.

What can it do?

Spark was initially developed for two applications where keeping data in memory helps: iterative algorithms, which are common in machine learning, and interactive data mining. In both cases, Spark can outperform Hadoop by 30x. However, you can use Spark’s convenient API to for general data processing too. Check out our example jobs.

Spark runs on the Mesos cluster manager, so it can coexist with Hadoop and other systems. It can read any data source supported by Hadoop.

Who uses it?

Spark was developed in the UC Berkeley AMP Lab. It’s used by several groups of researchers at Berkeley to run large-scale applications such as spam filtering, natural language processing and road traffic prediction. It’s also used to accelerate data analytics at Conviva. Spark is open source under a BSD license, so download it to check it out!

Hadoop must be doing something right to be treated as the solution to beat.

Still, depending on your requirements, Spark definitely merits your consideration.

« Newer Posts

Powered by WordPress