Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 11, 2013

Hadoop – 100x Faster… [With NO ETL!]

Filed under: ETL,Hadoop,HDFS,MapReduce,Topic Maps — Patrick Durusau @ 8:32 pm

Hadoop – 100x Faster. How we did it… by Nikita Ivanov.

From the post:

Almost two years ago, Dmitriy and I stood in front of a white board at GridGain’s office thinking: “How can we deliver the real-time performance of GridGain’s in-memory technology to Hadoop customers without asking them rip and replace their systems and without asking them to move their datasets off Hadoop?”.

Given Hadoop’s architecture – the task seemed daunting; and it proved to be one of the more challenging engineering puzzles we have had to solve.

After two years of development, tens of thousands of lines of Java, Scala and C++ code, multiple design iterations, several releases and dozens of benchmarks later, we finally built a product that can deliver real-time performance to Hadoop customers with seamless integration and no tedious ETL. Actual customers deployments can now prove our performance claims and validate our product’s architecture.

Here’s how we did it.

The Idea – In-Memory Hadoop Accelerator

Hadoop is based on two primary technologies: HDFS for storing data, and MapReduce for processing these data in parallel. Everything else in Hadoop and the Hadoop ecosystem sits atop these foundation blocks.

Originally, neither HDFS nor MapReduce were designed with real-time performance in mind. In order to deliver real-time processing without moving data out of Hadoop onto another platform, we had to improve the performance of both of these subsystems. (emphasis added)

The highlighted phrase is the key isn’t it?

In order to deliver real-time processing without moving data out of Hadoop onto another platform

ETL is down time, expense and risk of data corruption.

Given a choice between making your current data platform (of whatever type) more robust or risking a migration to a new data platform, which one would you choose?

Bear in mind those 2.5 million spreadsheets that Felienne mentions in her presentation.

Are you really sure you want to ETL on all you data?

As opposed to making your most critical data more robust and enhanced by other data? All while residing where it lives right now.

Are you ready to get off the ETL merry-go-round?

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress