Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 30, 2011

Storm: Distributed and Fault-tolerant Real-time Computation

Filed under: Storm,Topic Maps — Patrick Durusau @ 7:05 pm

Storm: Distributed and Fault-tolerant Real-time Computation by Nathan Marz.

Summary:

Nathan Marz explain Storm, a distributed fault-tolerant and real-time computational system currently used by Twitter to keep statistics on user clicks for every URL and domain.

A presentation that makes you want a cluster for Christmas!

Curious if processing by subject identifiers would be one way harness Storm for processing a topic map? Not a general solution but then I am not sure general solutions are all that interesting.

A topic map for a library, for instance, could be configured to merge topics that match on subject identifiers for items held in the catalog and to reject other input. Not doubt the other input may be of interest to others but there is no topic map requirement that any topic map application accept all input.

Some quick Storm links:

Storm

Storm Deploy – One click to deploy on AWS.

Storm Starter – sample code

ScalaStorm

Storm Wiki

September 26, 2011

Twitter Storm: Open Source Real-time Hadoop

Filed under: Hadoop,NoSQL,Storm — Patrick Durusau @ 6:55 pm

Twitter Storm: Open Source Real-time Hadoop by Bienvenido David III.

From the post:

Twitter has open-sourced Storm, its distributed, fault-tolerant, real-time computation system, at GitHub under the Eclipse Public License 1.0. Storm is the real-time processing system developed by BackType, which is now under the Twitter umbrella. The latest package available from GitHub is Storm 0.5.2, and is mostly written in Clojure.

Storm provides a set of general primitives for doing distributed real-time computation. It can be used for “stream processing”, processing messages and updating databases in real-time. This is an alternative to managing your own cluster of queues and workers. Storm can be used for “continuous computation”, doing a continuous query on data streams and streaming out the results to users as they are computed. It can also be used for “distributed RPC”, running an expensive computation in parallel on the fly.

See the post for links, details, quotes, etc.

My bet is that typologies are going to be data set specific. You?

BTW, I don’t think the local coffee shop offers free access to its cluster. Will have to check with them next week.

September 6, 2011

Hadoop Fatigue — Alternatives to Hadoop

Filed under: GraphLab,Hadoop,HPCC,MapReduce,Spark,Storm — Patrick Durusau @ 7:15 pm

Hadoop Fatigue — Alternatives to Hadoop

Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?

From the post:

After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.

  • Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
  • Documentation for the bloated Java API is sufficient, but not the most helpful.
  • HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
  • Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
  • Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
  • Large clusters require a dedicated team to keep it running properly, but that is not surprising.
  • Writing a Hadoop job becomes a software engineering task rather than a data analysis task.

Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:

Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.

Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.

It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.

August 5, 2011

A Storm is coming: more details and plans for release

Filed under: NoSQL,Storm — Patrick Durusau @ 7:07 pm

A Storm is coming: more details and plans for release

Storm is going to be released at Strange Loop on September 19!

From the post:

Here’s a recap of the three broad use cases for Storm:

  1. Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  2. Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  3. Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

The beauty of Storm is that it’s able to solve such a wide variety of use cases with just a simple set of primitives.

The really exciting part about all the current frenzy of development is imagining where it is going to be five (5) years from now.

« Newer Posts

Powered by WordPress