Hadoop Fatigue — Alternatives to Hadoop
Can you name six (6) alternatives to Hadoop? Or formulate why you choose Hadoop over those alternatives?
From the post:
After working extensively with (Vanilla) Hadoop professional for the past 6 months, and at home for research, I have found several nagging issues with Hadoop that have convinced me to look elsewhere for everyday use and certain applications. For these applications, the though of writing a Hadoop job makes me take a deep breath. Before I continue, I will say that I still love Hadoop and the community.
- Writing Hadoop jobs in Java is very time consuming because everything must be a class, and many times these classes extend several other classes or extend multiple interfaces; the Java API is very bloated. Adding a simple counter to a Hadoop job becomes a chore of its own.
- Documentation for the bloated Java API is sufficient, but not the most helpful.
- HDFS is complicated and has plenty of issues of its own. I recently heard a story about data loss in HDFS just because the IP address block used by the cluster changed.
- Debugging a failure is a nightmare; is it the code itself? Is it a configuration parameter? Is it the cluster or one/several machines on the cluster? Is it the filesystem or disk itself? Who knows?!
- Logging is verbose to the point that finding errors is like finding a needle in a haystack. That is, if you are even lucky to have an error recorded! I’ve had plenty of instances where jobs fail and there is absolutely nothing in the stdout or stderr logs.
- Large clusters require a dedicated team to keep it running properly, but that is not surprising.
- Writing a Hadoop job becomes a software engineering task rather than a data analysis task.
Hadoop will be around for a long time, and for good reason. MapReduce cannot solve every problem (fact), and Hadoop can solve even fewer problems (opinion?). After dealing with some of the innards of Hadoop, I’ve often said to myself “there must be a better way.” For large corporations that routinely crunch large amounts of data using MapReduce, Hadoop is still a great choice. For research, experimentation, and everyday data munging, one of these other frameworks may be better if the advantages of HDFS are not necessarily imperative:
Out of the six alternatives, I haven’t seen BashReduce or Disco, so I need to look those up.
Ah, the other alternatives: GraphLab, HPCC, Spark, and Preview of Storm: The Hadoop of Realtime Processing.
It is a pet peeve of mine that some authors force me to search for links they could have just as well entered. The New York Times of all places, refers to websites and does not include the URLs. And that is for paid subscribers.