Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 12, 2013

Sinking Data to Neo4j from Hadoop with Cascading

Filed under: Cascading,Hadoop,Neo4j — Patrick Durusau @ 10:18 am

Sinking Data to Neo4j from Hadoop with Cascading by Paul Ingles.

From the post:

Recently, I worked with a colleague (Paul Lam, aka @Quantisan on building a connector library to let Cascading interoperate with Neo4j: cascading.neo4j. Paul had been experimenting with Neo4j and Cypher to explore our data through graphs and we wanted an easy way to flow our existing data on Hadoop into Neo4j.

The data processing pipeline we’ve been growing at uSwitch.com is built around Cascalog, Hive, Hadoop and Kafka.

Once the data has been aggregated and stored a lot of our ETL is performed upon Cascalog and, by extension, Cascading. Querying/analysis is a mix of Cascalog and Hive. This layer is built upon our long-term data storage system: Hadoop; this, all combined, lets us store high-resolution data immutably at a much lower cost than uSwitch’s previous platform.

As Paul notes later in his post, this isn’t a fast solution, about 20,000 nodes a second.

But if that fits your requirements, could be a good place to start.

March 9, 2013

The history of Hadoop:…

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:50 pm

The history of Hadoop: From 4 nodes to the future of data by Derrick Harris.

From the post:

Depending on how one defines its birth, Hadoop is now 10 years old. In that decade, Hadoop has gone from being the hopeful answer to Yahoo’s search-engine woes to a general-purpose computing platform that’s poised to be the foundation for the next generation of data-based applications.

Alone, Hadoop is a software market that IDC predicts will be worth $813 million in 2016 (although that number is likely very low), but it’s also driving a big data market the research firm predicts will hit more than $23 billion by 2016. Since Cloudera launched in 2008, Hadoop has spawned dozens of startups and spurred hundreds of millions in venture capital investment since 2008.

In this four-part series, we’ll explain everything anyone concerned with information technology needs to know about Hadoop. Part I is the history of Hadoop from the people who willed it into existence and took it mainstream. Part II is more graphic; a map of the now-large and complex ecosystem of companies selling Hadoop products. Part III is a look into the future of Hadoop that should serve as an opening salvo for much of the discussion at our Structure: Data conference March 20-21 in New York. Finally, Part IV will highlight some the best Hadoop applications and seminal moments in Hadoop history, as reported by GigaOM over the years.

Whether you hope for insight into what makes a software paradigm successful or just to enrich your knowledge of Hadoop’s history, either way this is a great start on a history of Hadoop!

Enjoy!

March 8, 2013

hadoop illuminated (book)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:06 pm

hadoop illuminated by Mark Kerzner and Sujee Maniyam.

Largely a subjective judgment but I think the explanations of Hadoop are getting better.

Oh, the deep/hard stuff is still there, but the on ramp for getting to that point has become easier.

This book is a case in point.

I first saw this in a tweet by Computer Science.

March 7, 2013

Million Song Dataset in Minutes!

Filed under: Hadoop,MapReduce,Mortar,Pig,Python — Patrick Durusau @ 3:50 pm

Million Song Dataset in Minutes! (Video)

Actually 5:35 as per the video.

The summary of the video reads:

Created Web Project [zero install]

Loaded data from S3

Developed in Pig and Python [watch for the drop down menus of pig fragments]

ILLUSTRATE’d our work [perhaps the most impressive feature, tests code against sample of data]

Ran on Hadoop [drop downs to create a cluster]

Downloaded results [50 “densest songs”, see the video]

It’s not all “hands free” or without intellectual effort on your part.

But, a major step towards a generally accessible interface for Hadoop/MapReduce data processing.

MortarData2013

Filed under: Hadoop,MapReduce,Mortar,Pig — Patrick Durusau @ 3:36 pm

MortarData2013

Mortar has its own YouTube channel!

Unlike the History Channel, the MotorData2013 channel is educational and entertaining.

I leave it to you to guess whether those two adjectives apply to the History Channel. (Hint: Thirty (30) minutes of any Vikings episode should help you answer.)

Not a lot of data at the moment but what is there, well, I am going to cover one of those in a separate post.

March 6, 2013

Hadoop MapReduce: to Sort or Not to Sort

Filed under: Hadoop,MapReduce,Sorting — Patrick Durusau @ 7:22 pm

Hadoop MapReduce: to Sort or Not to Sort by Tendu Yogurtcu.

From the post:

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

MapReduce

Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:

The use cases include:

  • Optimized sort implementations.
  • Hash-based aggregations.
  • Ability to run a job with a subset of data.
  • Optimized full joins.

See Tendu’s post for the details.

I first saw this at Use Cases for Hadoop’s New Pluggable Sort by Alex Popescu.

PolyBase

Filed under: Hadoop,HDFS,MapReduce,PolyBase,SQL,SQL Server — Patrick Durusau @ 11:20 am

PolyBase

From the webpage:

PolyBase is a fundamental breakthrough in data processing used in SQL Server 2012 Parallel Data Warehouse to enable truly integrated query across Hadoop and relational data.

Complementing Microsoft’s overall Big Data strategy, PolyBase is a breakthrough new technology on the data processing engine in SQL Server 2012 Parallel Data Warehouse designed as the simplest way to combine non-relational data and traditional relational data in your analysis. While customers would normally burden IT to pre-populate the warehouse with Hadoop data or undergo an extensive training on MapReduce in order to query non-relational data, PolyBase does this all seamlessly giving you the benefits of “Big Data” without the complexities.

I must admit I had my hopes up for the videos labeled: “Watch informative videos to understand PolyBase.”

But the first one was only 2:52 in length and the second was about the Jim Gray Systems Lab (2:13).

So, fair to say it was short on details. 😉

The closest thing I found to a clue was in the PolyBase datasheet that reads (under PolyBase Use Cases, if you are reading along) where it says:

PolyBase introduces the concept of external tables to represent data residing in HDFS. An external table defines a schema (that is, columns and their types) for data residing in HDFS. The table’s metadata lives in the context of a SQL Server database and the actual table data resides in HDFS.

I assume that means that the data in HDFS could have multiple external tables for the same data? Depending upon the query?

Curious if the external tables and/or data types are going to have mapreduce capabilities built-in? To take advantage of parallel processing of the data?

BTW, for topic map types, subject identities for the keys and data types would be the same as with more traditional “internal” tables. In case you want to merge data.

Just out of curiosity, any thoughts on possible IP on external schemas being applied to data?

I first saw this at Alex Popescu’s Microsoft PolyBase: Unifying Relational and Non-Relational Data.

March 4, 2013

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™

Filed under: GraphBuilder,Graphs,Hadoop,MapReduce,Networks — Patrick Durusau @ 2:56 pm

GraphBuilder – A Scalable Graph Construction Library for Apache™ Hadoop™ by Theodore L. Willke, Nilesh Jain and Haijie Gu. (whitepaper)

Abstract:

The exponential growth in the pursuit of knowledge gleaned from data relationships that are expressed naturally as large and complex graphs is fueling new parallel machine learning algorithms. The nature of these computations is iterative and data-dependent. Recently, frameworks have emerged to perform these computations in a distributed manner at commercial scale. But feeding data to these frameworks is a huge challenge in itself. Since graph construction is a data-parallel problem, Hadoop is well-suited for this task but lacks some elements that would make things easier for data scientists that do not have domain expertise in distributed systems engineering. We developed GraphBuilder, a scalable graph construction software library for Apache Hadoop, to address this gap. GraphBuilder offloads many of the complexities of graph construction, including graph formation, tabulation, compression, transformation, partitioning, output formatting, and serialization. It is written in Java for ease of programming and scales using the MapReduce parallel programming model. We describe the motivation for GraphBuilder, its architecture, and present two case studies that provide a preliminary evaluation.

The “whitepaper” introduction to GraphBuilder.

March 3, 2013

Spring for Hadoop …

Filed under: Hadoop,Spring Hadoop — Patrick Durusau @ 3:41 pm

Spring for Hadoop simplifies application development

From the post:

After almost exactly a year of development, SpringSource has released Spring for Hadoop 1.0 with the goal of making the development of Hadoop applications easier for users of the distributed application framework. VMware engineer Costin Leau said in the release announcement that the company has often seen developers use the out-of-the-box tools that come with Hadoop in ways that lead to a “poorly structured collection of command line utilities, scripts and pieces of code stitched together.” Spring for Hadoop aims to change this by applying the Template API design pattern from Spring to Hadoop.

This application gives helper classes such as HBaseTemplate, HiveTemplate and PigTemplate which interface with the different parts of the Hadoop ecosystem, Java-centric APIs such as Cascading can also be used with or without additional configuration. The software enables Spring functionality such as thread-safe access to lower level resources and lightweight object mapping in Hadoop applications. Leau also says that Spring for Hadoop is designed to allow projects to grow organically. To do this, users can mix and match various runner classes for scripts and, as the complexity of the application increases, developers can migrate to Spring Batch and manage these processes through a REST-based API.

Spring for Hadoop 1.0 is available from the SpringSource web site under the Apache 2.0 License. The developers say they are testing the software daily against various Hadoop 1.x distributions such as Apache Hadoop and Greenplum HD, as well as Cloudera CDH3 and CDH4. Greenplum HD already includes Spring for Hadoop in its distribution. Support for Hadoop 2.x is expected “in the near future”.

I’m going to leave characterization of present methods of working with Hadoop for others. 😉

Project Panthera…

Filed under: Hadoop,SQL — Patrick Durusau @ 1:38 pm

Project Panthera: Better Analytics with SQL and Hadoop

Another Hintel project focused on Hadoop.

From the project page:

We have worked closely with many enterprise users over the past few years to enhance their new data analytics platforms using the Hadoop stack. Increasingly, these platforms have evolved from a batch-style, custom-built system for unstructured data, to become an integral component of the enterprise application framework. While the Hadoop stack provides a solid foundation for these platforms, gaps remain; in particular, enterprises are looking for full SQL support to seamlessly integrate these new platforms into their existing enterprise data analytics infrastructure. Project Panthera is our open source efforts to provide efficient support of standard SQL features on Hadoop, so as to enable many important, advanced use cases not supported by Hadoop today, including:

  • Exploring data with complex and sophisticated SQL queries (such as nested subqueries with aggregation functions) – for instance, about half of the queries in TPC-H (a standard decision support benchmark) use subqueries
  • Efficient storage engine for high-update rate SQL query workloads – while HBase is often used to support such workloads, query processing (e.g., Hive) on HBase can incur significant overheads as the storage engine completely ignores the SQL relational model
  • Utilizations of new hardware platform technologies (e.g., new flash technologies and large RAM capacities available in modern servers) for efficient SQL query processing

The objective of Project Panthera is to collaborate with the larger Hadoop community in enhancing the SQL support of the platform for a broader set of use cases. We are building these new capabilities on top of the Hadoop stack, and contributing necessary improvements of the underlying stack back to the existing Apache Hadoop projects. Our initial goals are:

SQL is still alive! Who knew? 😉

A good example of new technologies not replacing old ones, but being grafted onto them.

With that grafting, semantic impedance between the systems remains.

You can remap over that impedance on an ad hoc and varying basis.

Or, you can create mapping today that can be re-used tomorrow.

Which sounds like a better option to you?

Project Rhino

Filed under: Cybersecurity,Hadoop,MapReduce,Project Rhino,Security — Patrick Durusau @ 1:21 pm

Project Rhino

Is Wintel becoming Hintel? 😉

If history is a guide, that might not be a bad thing.

From the project page:

As Hadoop extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with all Hadoop projects and HBase must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

The core of the Apache Hadoop ecosystem as it is commonly understood is:

  • Core: A set of shared libraries
  • HDFS: The Hadoop filesystem
  • MapReduce: Parallel computation framework
  • ZooKeeper: Configuration management and coordination
  • HBase: Column-oriented database on HDFS
  • Hive: Data warehouse on HDFS with SQL-like access
  • Pig: Higher-level programming language for Hadoop computations
  • Oozie: Orchestration and workflow management
  • Mahout: A library of machine learning and data mining algorithms
  • Flume: Collection and import of log and event data
  • Sqoop: Imports data from relational databases

These components are all separate projects and therefore cross cutting concerns like authN, authZ, a consistent security policy framework, consistent authorization model and audit coverage loosely coordinated. Some security features expected by our customers, such as encryption, are simply missing. Our aim is to take a full stack view and work with the individual projects toward consistent concepts and capabilities, filling gaps as we go.

Like I said, might not be a bad thing!

Different from recent government rantings. Focused on a particular stack with the intent to analyze that stack, not the world at large, and to make specific improvements (read measurable results).

March 2, 2013

Hadoop++ and HAIL [and LIAH]

Filed under: Hadoop,HAIL,MapReduce — Patrick Durusau @ 3:33 pm

Hadoop++ and HAIL

From the webpage:

Hadoop++

Hadoop++: Nowadays, working over very large data sets (Petabytes of information) is a common reality for several enterprises. In this context, query processing is a big challenge and becomes crucial. The Apache Hadoop project has been adopted by many famous companies to query their Petabytes of information. Some examples of such enterprises are Yahoo! and Facebook. Recently, some researchers from the database community indicated that Hadoop may suffer from performance issues when running analytical queries. We believe this is not an inherent problem of the MapReduce paradigm but rather some implementation choices done in Hadoop. Therefore, the overall goal of Hadoop++ project is to improve Hadoop’s performance for analytical queries. Already, our preliminary results show an improvement of Hadoop++ over Hadoop by up to a factor 20. In addition, we are currently investigating the impact of a number of other optimizations techniques.

HAIL elephant

HAIL (Hadoop Aggressive Indexing Library) is an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop and even outperforms Hadoop++.

Isn’t that a cool aggressive elephant?

But before you get too excited, consider:

Towards Zero-Overhead Adaptive Indexing in Hadoop by Stefan Richter, Jorge-Arnulfo Quiané-Ruiz, Stefan Schuh, Jens Dittrich.

Abstract:

Several research works have focused on supporting index access in MapReduce systems. These works have allowed users to significantly speed up selective MapReduce jobs by orders of magnitude. However, all these proposals require users to create indexes upfront, which might be a difficult task in certain applications (such as in scientific and social applications) where workloads are evolving or hard to predict. To overcome this problem, we propose LIAH (Lazy Indexing and Adaptivity in Hadoop), a parallel, adaptive approach for indexing at minimal costs for MapReduce systems. The main idea of LIAH is to automatically and incrementally adapt to users’ workloads by creating clustered indexes on HDFS data blocks as a byproduct of executing MapReduce jobs. Besides distributing indexing efforts over multiple computing nodes, LIAH also parallelises indexing with both map tasks computation and disk I/O. All this without any additional data copy in main memory and with minimal synchronisation. The beauty of LIAH is that it piggybacks index creation on map tasks, which read relevant data from disk to main memory anyways. Hence, LIAH does not introduce any additional read I/O-costs and exploit free CPU cycles. As a result and in contrast to existing adaptive indexing works, LIAH has a very low (or invisible) indexing overhead, usually for the very first job. Still, LIAH can quickly converge to a complete index, i.e. all HDFS data blocks are indexed. Especially, LIAH can trade early job runtime improvements with fast complete index convergence. We compare LIAH with HAIL, a state-of-the-art indexing technique, as well as with standard Hadoop with respect to indexing overhead and workload performance. In terms of indexing overhead, LIAH can completely index a dataset as a byproduct of only four MapReduce jobs while incurring a low overhead of 11% over HAIL for the very first MapReduce job only. In terms of workload performance, our results show that LIAH outperforms Hadoop by up to a factor of 52 and HAIL by up to a factor of 24.

The Information Systems Group, Saarland University, Prof. Dr. Jens Dittrich is a place to watch.

An Overview of Scalding

Filed under: Hadoop,Scalding — Patrick Durusau @ 3:06 pm

An Overview of Scalding by Dean Wampler.

From the description:

Dean Wampler, Ph.D., is Principal Consultant at Think Big Analytics. In this video he will cover its benefits over the Java API include a dramatic reduction in the source code required, reflecting several Scala improvements over Java, full access to “functional programming” constructs that are ideal for data problems, and a Matrix library addition to support machine learning and other algorithms. He also demonstrates the benefits of Scalding using examples and explains just enough Scala syntax so you can follow along. Dean’s philosophy is that there is no better way to write general-purpose Hadoop MapReduce programs when specialized tools like Hive and Pig aren’t quite what you need. This presentation was given on February 12th at the Nokia offices in Chicago, IL.

Slides: slideshare.net/ChicagoHUG/scalding-for-hadoop

During this period of rapid innovation around “big data,” what interests me is the development of tools to fit problems.

As opposed to fitting problems to fixed data models and tools.

Both require a great deal of skill, but they are different skill sets.

Yes?

I first saw this at Alex Popescu’s myNoSQL.

March 1, 2013

Pig Eye for the SQL Guy

Filed under: Hadoop,MapReduce,Pig,SQL — Patrick Durusau @ 5:33 pm

Pig Eye for the SQL Guy by Cat Miller.

From the post:

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Do you speak SQL?

Want to learn to speak Pig?

This is the right post for you!

WANdisco: Free Hadoop Training Webinars

Filed under: Hadoop,HBase,MapReduce — Patrick Durusau @ 5:31 pm

WANdisco: Free Hadoop Training Webinars

WANdisco has four Hadoop webinars to put on your calendar:

A Hadoop Overview

This webinar will include a review of major components including HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications. An overview of Hadoop’s ecosystem will also be provided. Other topics covered will include a review of public and private cloud deployment options, and common business use cases.

Register now Weds, March 13, 10:00 a.m. PT/1:00 p.m. ET

A Hadoop Deep Dive

This webinar will cover Hadoop misconceptions (not all clusters are thousands of machines), information about real world Hadoop deployments, a detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.), an in-depth look at HDFS, and an explanation of MapReduce in relation to latency and dependence on other Hadoop activities.

This webinar will introduce attendees to concepts they will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level.

Register now Weds, March 27, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: A MapReduce Tutorial

This webinar will cover MapReduce at a deep technical level.

This session will cover the history of MapReduce, how a MapReduce job works, its logical flow, the rules and types of MapReduce jobs, de-bugging and testing MapReduce jobs, writing foolproof MapReduce jobs, various workflow tools that are available, and more.

Register now Weds, April 10, 10:00 a.m. PT/1:00 p.m. ET

Hadoop: HBase In-Depth

This webinar will provide a deep technical review of HBase, and cover flexibility, scalability, components (cells, rows, columns, qualifiers), schema samples, hardware requirements and more.

Register now Weds, April 24, 10:00 a.m. PT/1:00 p.m. ET

I first saw this at: WANdisco Announces Free Hadoop Training Webinars.

A post with no link to WANdisco or to registration for any of the webinars.

If you would prefer that I put in fewer hyperlinks to resources, please let me know.

February 27, 2013

R and Hadoop Data Analysis – RHadoop

Filed under: Hadoop,R,RHadoop — Patrick Durusau @ 5:34 pm

R and Hadoop Data Analysis – RHadoop by Istvan Szegedi.

From the post:

R is a programming language and a software suite used for data analysis, statistical computing and data visualization. It is highly extensible and has object oriented features and strong graphical capabilities. At its heart R is an interpreted language and comes with a command line interpreter – available for Linux, Windows and Mac machines – but there are IDEs as well to support development like RStudio or JGR.

R and Hadoop can complement each other very well, they are a natural match in big data analytics and visualization. One of the most well-known R packages to support Hadoop functionalities is RHadoop that was developed by RevolutionAnalytics.

Nice introduction that walks you through installation and illustrates the use of RHadoop for analysis.

The ability to analyze “big data” is becoming commonplace.

The more that becomes a reality, the greater the burden on the user to critically evaluate the analysis that produced the “answers.”

Yes, repeatable analysis yielded answer X, but that just means applying the same assumptions to the same data gave the same result.

The same could be said about division by zero, although no one would write home about it.

Big Data Central

Filed under: BigData,Hadoop,MapReduce — Patrick Durusau @ 5:34 pm

Big Data Central by LucidWorks™

From LucidWorks™ Launches Big Data Central:

The new website, Big Data Central, is meant to become the primary source of educational materials, case studies, trends, and insights that help companies navigate the changing data management landscape. At Big Data Central, visitors can find, and contribute to, a wide variety of information including:

  • Use cases and best practices that highlight lessons learned from peers
  • Industry and analyst reports that track trends and hot topics
  • Q&As that answer some of the most common questions plaguing firms today about Big Data implementations

Definitely one for the news feed!

Did EMC Just Say Fork You To The Hadoop Community? [No,…]

Filed under: Hadoop,Open Source — Patrick Durusau @ 5:34 pm

Did EMC Just Say Fork You To The Hadoop Community? by Shaun Connolly.

I need to quote Shaun for context before I explain why my answer is no.

All in on Hadoop?

Glancing at the Pivotal HD diagram in the GigaOM article, they’ve made it easy to distinguish the EMC proprietary components in Blue from the Apache Hadoop-related components in Green. And based on what Scott Yara says “We literally have over 300 engineers working on our Hadoop platform”.

Wow, that’s a lot of engineers focusing on Hadoop! Since Scott Yara admitted that “We’re all in on Hadoop, period.”, a large number of those engineers must be working on the open source Apache Hadoop-related projects labeled in Green in the diagram, right?

So a simple question is worth asking: How many of those 300 engineers are actually committers* to the open source projects Apache Hadoop, Apache Hive, Apache Pig, and Apache HBase?

John Furrier actually asked this question on Twitter and got a reply from Donald Miner from the Greenplum team. The thread is as follows:

tweet thread

Since I agree with John Furrier that understanding the number of committers is kinda related to the context of Scott Yara’s claim, I did a quick scan through the committers pages for Hadoop, Hive, Pig and HBase to seek out the large number of EMC engineers spending their time improving these open source projects. Hmmm….my quick scan yielded a curious absence of EMC engineers directly contributing to these Apache projects. Oh well, I guess the vast majority of those 300 engineers are working on the EMC proprietary technology in the blue boxes.

Why Do Committers Matter?

Simply put: Just because you can read Moby-Dick doesn’t make you talented enough to have authored it.

Committers matter because they are the talented authors who devote their time and energy on working within the Apache Software Foundation community adding features, fixing bugs, and reviewing and approving changes submitted by the other committers. At Hortonworks, we have over 50 committers, across the various Hadoop-related projects, authoring code and working with the community to make their projects better.

This is simply how the community-driven open source model works. And believe it or not, you actually have to be in the community before you can claim you are leading the community and authoring the code!

So when EMC says they are “all-in on Hadoop” but have nary a committer in sight, then that must mean they are “all-in for harvesting the work done by others in the Hadoop community”. Kind of a neat marketing trick, don’t you think?

Scott Yara effectively says that it would take about $50 to $100 million dollars and 300 engineers to do what they’ve done. Sounds expensive, hard, and untouchable doesn’t it? Well, let’s take a close look at the Apache Hadoop community in comparison. Over the lifetime of just the Apache Hadoop project, there have been over 1200 people across more than 80 different companies or entities who have contributed code to Hadoop. Mr. Yara, I’ll see your 300 and raise you a community!

I say no because I remember another Apache project, the Apache webserver.

At last count, the Apache webserver has 63% of the market. The nearest competitor is Microsoft-IIS with 16.6%. Microsoft is in the Hadoop fold thanks to Hortonworks. Assuming Nginx to be the equivalent of Cloudera, there is another 15% of the market. (From Usage of web servers for websites)

If my math is right, that’s approximately 95% of the market.*

The longer EMC remains in self-imposed exile, the more its “Hadoop improvements” will drift from the mainstream releases.

So, my answer is: No, EMC has announced they are forking themselves.

That will carry reward enough without the Hadoop community fretting over much about it.


* Yes, the market share is speculation on my part but has more basis in reality than Mandiant’s claims about Chinese hackers.

Apache Pig: It goes to 0.11

Filed under: Hadoop,MapReduce,Pig — Patrick Durusau @ 5:33 pm

Apache Pig: It goes to 0.11

From the post:

After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 — it’s a great way to contribute to open source software.

This blog post hits some of the highlights of the release. Pig users may also find a presentation by Daniel Dai, which includes code and output samples for the new operators, helpful.

And from Hortonworks’ post on the release:

  • A DateTime datatype, documentation here.
  • A RANK function, documentation here.
  • A CUBE operator, documentation here.
  • Groovy UDFs, documentation here.

If you remember Robert Barta’s Cartesian expansion of tuples, you will find it in the CUBE operator.

Microsoft and Hadoop, Sitting in a Tree…*

Filed under: Hadoop,Hortonworks,MapReduce,Microsoft — Patrick Durusau @ 2:55 pm

Putting the Elephant in the Window by John Kreisa.

From the post:

For several years now Apache Hadoop has been fueling the fast growing big data market and has become the defacto platform for Big Data deployments and the technology foundation for an explosion of new analytic applications. Many organizations turn to Hadoop to help tame the vast amounts of new data they are collecting but in order to do so with Hadoop they have had to use servers running the Linux operating system. That left a large number of organizations who standardize on Windows (According to IDC, Windows Server owned 73 percent of the market in 2012 – IDC, Worldwide and Regional Server 2012–2016 Forecast, Doc # 234339, May 2012) without the ability to run Hadoop natively, until today.

We are very pleased to announce the availability of Hortonworks Data Platform for Windows providing organizations with an enterprise-grade, production-tested platform for big data deployments on Windows. HDP is the first and only Hadoop-based platform available on both Windows and Linux and provides interoperability across Windows, Linux and Windows Azure. With this release we are enabling a massive expansion of the Hadoop ecosystem. New participants in the community of developers, data scientist, data management professionals and Hadoop fans to build and run applications for Apache Hadoop natively on Windows. This is great news for Windows focused enterprises, service provides, software vendors and developers and in particular they can get going today with Hadoop simply by visiting our download page.

This release would not be possible without a strong partnership and close collaboration with Microsoft. Through the process of creating this release, we have remained true to our approach of community-driven enterprise Apache Hadoop by collecting enterprise requirements, developing them in open source and applying enterprise rigor to produce a 100-precent open source enterprise-grade Hadoop platform.

Now there is a very smart marketing move!

A smaller share of a larger market is always better than a large share of a small market.

(You need to be writing down these quips.) 😉

Seriously, take note of how Hortonworks used the open source model.

They did not build Hadoop in their image and try to sell it to the world.

Hortonworks gathered requirements from others and built Hadoop to meet their needs.

Open source model in both cases, very different outcomes.

* I didn’t remember the rhyme beyond the opening line. Consulting the oracle (Wikipedia), I discovered Playground song. 😉

February 22, 2013

Hadoop Adds Red Hat [More Hadoop Silos Coming]

Filed under: Hadoop,MapReduce,Red Hat,Semantic Diversity,Semantic Inconsistency — Patrick Durusau @ 1:27 pm

Red Hat Unveils Big Data and Open Hybrid Cloud Direction

From the post:

Red Hat, Inc. (NYSE: RHT), the world’s leading provider of open source solutions, today announced its big data direction and solutions to satisfy enterprise requirements for highly reliable, scalable, and manageable solutions to effectively run their big data analytics workloads. In addition, Red Hat announced that the company will contribute its Red Hat Storage Hadoop plug-in to the ApacheTM Hadoop® open community to transform Red Hat Storage into a fully-supported, Hadoop-compatible file system for big data environments, and that Red Hat is building a robust network of ecosystem and enterprise integration partners to deliver comprehensive big data solutions to enterprise customers. This is another example of Red Hat’s strategic commitment to big data customers and its continuing efforts to provide them with enterprise solutions through community-driven innovation.

The more Hadoop grows, the more Hadoop silos will as well.

You will need Hadoop and semantic skills to wire Hadoop silos together.

Re-wire with topic maps to avoid re-wiring the same Hadoop silos over and over again.

I first saw this at Red Hat reveal big data plans, open sources HDFS replacement by Elliot Bentley.

February 21, 2013

Hadoop silos need integration…

Filed under: Data Integration,Hadoop,Semantic Diversity,Semantic Inconsistency — Patrick Durusau @ 7:50 pm

Hadoop silos need integration, manage all data as asset, say experts by Brian McKenna.

From the post:

Big data hype has caused infantile disorders in corporate organisations over the past year. Hadoop silos, an excess of experimentation, and an exaggeration of the importance of data scientists are among the teething problems of big data, according to experts, who suggest organisations should manage all data as an asset.

Steve Shelton, head of data services at consultancy Detica, part of BAE Systems, said Hadoop silos have become part of the enterprise IT landscape, both in the private and public sectors. “People focused on this new thing called big data and tried to isolate it [in 2011 and 2012],” he said.

The focus has been too concentrated on non-traditional data types, and that has been driven by the suppliers. The business value of data is more effectively understood when you look at it all together, big or otherwise, he said.

Have big data technologies been a distraction? “I think it has been an evolutionary learning step, but businesses are stepping back now. When it comes to information governance, you have to look at data across the patch,” said Shelton.

He said Detica had seen complaints about Hadoop silos, and these were created by people going through a proof-of-concept phase, setting up a Hadoop cluster quickly and building a team. But a Hadoop platform involves extra costs on top, in terms of managing it and integrating it into your existing business processes.

“It’s not been a waste of time and money, it is just a stage. And it is not an insurmountable challenge. The next step is to integrate those silos, but the thinking is immature relative to the technology itself,” said Shelton.

I take this as encouraging news for topic maps.

Semantically diverse data has been stores in semantically diverse datastores. Data, which if integrated, could provide business value.

Again.

There will always be a market for topic maps because people can’t stop creating semantically diverse data and data stores.

How’s that for long term market security?

No matter what data or data storage technology arises, semantic inconsistency will be with us always.

February 20, 2013

Cascading into Hadoop with SQL

Filed under: Cascading,Hadoop,Lingual,SQL — Patrick Durusau @ 9:24 pm

Cascading into Hadoop with SQL by Nicole Hemsoth.

From the post:

Today Concurrent, the company behind the Cascading Hadoop abstraction framework, announced a new trick to help developers tame the elephant.

The company, which is focused on simplifying Hadoop, has introduced a SQL parser that sits on top of Cascading with a JDBC Interface. Concurrent says that they’ll be pushing out over the next couple of weeks with hopes that developers will take it under their wing and support the project.

According to the company’s CTO and founder, Chris Wensel, the goal is to get the commuity to rally around a new way to let non-programmers make use of data that’s locked in Hadoop clusters and let them more easily move applications onto Hadoop clusters.

The newly-announced approach to extending the abstraction is called Lingual, which is aimed at putting Hadoop within closer sights for those familiar with SQL, JDBC and traditional BI tools. It provides what the company calls “true SQL for Cascading and Hadoop” to enable easier creation and running of applications on Hadoop and again, to tap into that growing pool of Hadoop-seekers who lack the expertise to back mission-critical apps on the platform.

Wensel says that Lingual’s goal is to provide an ANSI-standard SQL interface that is designed to play well with all of the big name distros running on site or in cloud environments. This will allow a “cut and paste” capability for existing ANSI SQL code from traditional data warehouses so users can access data that’s locked away on a Hadoop cluster. It’s also possible to query and export data from Hadoop right into a wide range of BI tools.

Another example of meeting a large community of uses where they are, not where you would like for them to be.

Targeting a market that already exists is easier than building a new one from the ground up.

Securing Hadoop with Knox Gateway

Filed under: Hadoop,Knox Gateway,Security — Patrick Durusau @ 9:23 pm

Securing Hadoop with Knox Gateway by Kevin Minder

From the post:

Back in the day, in order to secure a Hadoop cluster all you needed was a firewall that restricted network access to only authorized users. This eventually evolved into a more robust security layer in Hadoop… a layer that could augment firewall access with strong authentication. Enter Kerberos. Around 2008, Owen O’Malley and a team of committers led this first foray into security and today, Kerberos is still the primary way to secure a Hadoop cluster.

Fast-forward to today… Widespread adoption of Hadoop is upon us. The enterprise has placed requirements on the platform to not only provide perimeter security, but to also integrate with all types of authentication mechanisms. Oh yeah, and all the while, be easy to manage and to integrate with the rest of the secured corporate infrastructure. Kerberos can still be a great provider of the core security technology but with all the touch-points that a user will have with Hadoop, something more is needed.

The time has come for Knox.

Timely news of an effort at security that doesn’t depend upon obscurity (or inner circle secrecy).

Hadoop installations, in topic map flow flow and not, need to pay attention to this project.

February 18, 2013

So, what’s brewing with HCatalog

Filed under: Hadoop,HCatalog — Patrick Durusau @ 11:36 am

So, what’s brewing with HCatalog

From the post:

Apache HCatalog announced release of version 0.5.0 in the past week. Along with that, it has initiated steps to graduate from an incubator project to be an Apache Top Level project or sub-project. Let’s look at the current state of HCatalog, its increasing relevance and where it is heading.

HCatalog for a small introduction, is a “table management and storage management layer for Apache Hadoop” which:

  • enables Pig, MapReduce, and Hive users to easily share data on the grid.
  • provides a table abstraction for a relational view of data in HDFS
  • ensures format indifference (viz RCFile format, text files, sequence files)
  • provides a notification service when new data becomes available

Nice summary of the current state of HCatalog, pointing to a presentation by Alan Gates from Big Data Spain 2012.

Real World Hadoop – Implementing a Left Outer Join in Map Reduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 6:25 am

Real World Hadoop – Implementing a Left Outer Join in Map Reduce by Matthew Rathbone.

From the post:

This article is part of my guide to map reduce frameworks, in which I implement a solution to a real-world problem in each of the most popular hadoop frameworks.

If you’re impatient, you can find the code for the map-reduce implementation on my github, otherwise, read on!

The Problem
Let me quickly restate the problem from my original article.

I have two datasets:

  1. User information (id, email, language, location)
  2. Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)

Given these datasets, I want to find the number of unique locations in which each product has been sold.

Not as easy a problem as it appears. But I suspect a common one in practice.

Clydesdale: Structured Data Processing on MapReduce

Filed under: Clydesdale,Hadoop,MapReduce — Patrick Durusau @ 6:16 am

Clydesdale: Structured Data Processing on MapReduce by Tim Kaldewey, Eugene J. Shekita, Sandeep Tata.

Abstract:

MapReduce has emerged as a promising architecture for large scale data analytics on commodity clusters. The rapid adoption of Hive, a SQL-like data processing language on Hadoop (an open source implementation of MapReduce), shows the increasing importance of processing structured data on MapReduce platforms. MapReduce offers several attractive properties such as the use of low-cost hardware, fault-tolerance, scalability, and elasticity. However, these advantages have required a substantial performance sacrifice.

In this paper we introduce Clydesdale, a novel system for structured data processing on Hadoop – a popular implementation of MapReduce. We show that Clydesdale provides more than an order of magnitude in performance improvements compared to existing approaches without requiring any changes to the underlying platform. Clydesdale is aimed at workloads where the data fits a star schema. It draws on column oriented storage, tailored join-plans, and multicore execution strategies and carefully fits them into the constraints of a typical MapReduce platform. Using the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive. This demonstrates that MapReduce in general, and Hadoop in particular, is a far more compelling platform for structured data processing than previous results suggest. (emphasis in original)

The authors make clear that Clydesdale is a research prototype and lacks many features needed for full production use.

But an order of magnitude and sometimes two orders of magnitude improvement should pique your interest in helping with such improvements.

I find the “re-use” of existing Hadoop infrastructure particularly exciting.

Order of magnitude or more gains with current approaches is a signal someone is thinking about issues and not simply throwing horsepower at a problem.

I first saw this in NoSQL Weekly, Issue 116.

February 15, 2013

The Family of MapReduce and Large Scale Data Processing Systems

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:03 pm

The Family of MapReduce and Large Scale Data Processing Systems by Sherif Sakr, Anna Liu, Ayman G. Fayoumi.

Abstract:

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.

At twenty-seven pages and one hundred and thirty-five references, this is one for the weekend and perhaps beyond!

Definitely a paper to master if you are interested in seeing the next generation of MapReduce techniques before your competition.

I first saw this at Alex Popescu’s The Family of MapReduce and Large Scale Data Processing Systems.

February 13, 2013

Imperative and Declarative Hadoop: TPC-H in Pig and Hive

Filed under: Hadoop,Hive,MapReduce,Pig,TPC-H — Patrick Durusau @ 11:41 am

Imperative and Declarative Hadoop: TPC-H in Pig and Hive by Russell Jurney.

From the post:

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

TPC-H was implemented for Hive in HIVE-600 and for Pig in PIG-2397 by Hortonworks intern Jie Li. In going over this work, I was struck by how it outlined differences between Pig and SQL.

There seems to be tendency for simple SQL to provide greater clarity than Pig. At some point as the TPC-H queries become more demanding, complex SQL seems to have less clarity than the comparable Pig. Lets take a look.
(emphasis in original)

A refresher in the lesson that what solution you need, in this case Hive or PIg, depends upon your requirements.

Use either one blindly at the risk of poor performance or failing to meet other requirements.

February 9, 2013

Production-Ready Hadoop 2 Distribution

Filed under: Hadoop,MapReduce,Marketing — Patrick Durusau @ 8:21 pm

WANdisco Launches Production-Ready Hadoop 2 Distribution

From the post:

WANdisco today announced it has made its WANdisco Distro (WDD) available for free download.

WDD is a production-ready version powered by Apache Hadoop 2 based on the most recent release, including the latest fixes. These certified Apache Hadoop binaries undergo the same quality assurance process as WANdisco’s enterprise software solutions.

The WDD team is led by Dr. Konstantin Boudnik, who is one of the original Hadoop developers, has been an Apache Hadoop committer since 2009 and served as a Hadoop architect with Yahoo! This team of Hadoop development, QA and support professionals is focused on software quality. WANdisco’s Apache Hadoop developers have been involved in the open source project since its inception and have the authority within the Apache Hadoop community to make changes to the code base, for fast fixes and enhancements.

By adding its active-active replication technology to WDD, WANdisco is able to eliminate the single points of failure (SPOFs) and performance bottlenecks inherent in Hadoop. With this technology, the same data is simultaneously readable and writable on every server, and every server is actively supporting user requests. There are no passive or standby servers with complex administration procedures required for failover and recovery.

WANdisco (Somehow the quoted post failed to include the link.)

Download WANdisco Distro (WDD)

Two versions for download:

64-bit WDD v3.1.0 for RHEL 6.1 and above

64-bit WDD v3.1.0 for CentOS 6.1 and above

You do have to register and are emailed a download link.

I know marketing people have a formula that if you pester 100 people you will make N sales.

I suppose but if your product is compelling enough, people are going to be calling you.

When was the last time you heard of a drug dealer making cold calls to sell dope?

« Newer PostsOlder Posts »

Powered by WordPress