Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 17, 2012

Advanced Analytics with R and SAP HANA

Filed under: Hadoop,Oracle,R,SAP — Patrick Durusau @ 8:20 pm

Advanced Analytics with R and SAP HANA. Slides by Jitender Aswani and Jens Doerpmund.

Ajay Ohri reports that SAP is following Oracle in using R. And we have all heard about Hadoop and R.

Question: What language for analytics are you going to start learning for Oracle, SAP and Hadoop? (To say nothing of mining/analysis for topic maps.)

March 14, 2012

R and Hadoop: Step-by-step tutorials

Filed under: Hadoop,R — Patrick Durusau @ 7:36 pm

R and Hadoop: Step-by-step tutorials by David Smith.

From the post:

At the recent Big Data Workshop held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up an R and Hadoop infrastructure. Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more Hadoop experimentation, now with more realistic data sizes.) And finally, as a true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple nodes running Hadoop and R.

More pointers and resources await you at David’s post.

HBase + Hadoop + Xceivers

Filed under: Hadoop,HBase — Patrick Durusau @ 7:35 pm

HBase + Hadoop + Xceivers by Lars George.

From the post:

Introduction

Some of the configuration properties found in Hadoop have a direct effect on clients, such as HBase. One of those properties is called “dfs.datanode.max.xcievers”, and belongs to the HDFS subproject. It defines the number of server side threads and – to some extent – sockets used for data connections. Setting this number too low can cause problems as you grow or increase utilization of your cluster. This post will help you to understand what happens between the client and server, and how to determine a reasonable number for this property.

The Problem

Since HBase is storing everything it needs inside HDFS, the hard upper boundary imposed by the ”dfs.datanode.max.xcievers” configuration property can result in too few resources being available to HBase, manifesting itself as IOExceptions on either side of the connection.

This is a true sysadmin type post.

Error messages say “DataXceiver,” but set the “dfs.datanode.max.xcievers” property. Post notes “xcievers” is misspelled.

Detailed coverage of the nature of the problem, complete with sample log entries. Along with suggested solutions.

And, word of current work to improve the current situation.

If you are using HBase and Hadoop, put a copy of this with your sysadmin stuff.

March 12, 2012

Introducing Spring Hadoop

Filed under: Hadoop,Spring Hadoop — Patrick Durusau @ 8:04 pm

Introducing Spring Hadoop by Costin Leau.

From the post:

I am happy to announce that the first milestone release (1.0.0.M1) for Spring Hadoop project is available and talk about some of the work we have been doing over the last few months. Part of the Spring Data umbrella, Spring Hadoop provides support for developing applications based on Hadoop technologies by leveraging the capabilities of the Spring ecosystem. Whether one is writing stand-alone, vanilla MapReduce applications, interacting with data from multiple data stores across the enterprise, or coordinating a complex workflow of HDFS, Pig, or Hive jobs, or anything in between, Spring Hadoop stays true to the Spring philosophy offering a simplified programming model and addresses "accidental complexity" caused by the infrastructure. Spring Hadoop, provides a powerful tool in the developer arsenal for dealing with big data volumes.

I rather like that, “accidental complexity.” 😉

Still, if you are learning Hadoop, Spring Hadoop may ease the learning curve. Not to mention making application development easier. Your mileage may vary but it is worth a long look.

Graph Degree Distributions using R over Hadoop

Filed under: Graphs,Hadoop,R — Patrick Durusau @ 8:04 pm

Graph Degree Distributions using R over Hadoop

From the post:

The purpose of this post is to demonstrate how to express the computation of two fundamental graph statistics — each as a graph traversal and as a MapReduce algorithm. The graph engines explored for this purpose are Neo4j and Hadoop. However, with respects to Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via native Hadoop (HDFS + MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used. RHadoop is a small, open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using native R.

The two graph algorithms presented compute degree statistics: vertex in-degree and graph in-degree distribution. Both are related, and in fact, the results of the first can be used as the input to the second. That is, graph in-degree distribution is a function of vertex in-degree. Together, these two fundamental statistics serve as a foundation for more quantifying statistics developed in the domains of graph theory and network science.

Observes that 10 billion elements (nodes + edges) require a single server. In the 100 billion element range, multiple servers are required.

Despite the emphasis on “big data,” 10 billion elements would be sufficient for many purposes.

Interesting use of R with Hadoop.

March 11, 2012

Talend Open Studio for Big Data w/ Hadoop

Filed under: Hadoop,MapReduce,Talend,Tuple MapReduce,Tuple-Join MapReduce — Patrick Durusau @ 8:09 pm

Talend Empowers Apache Hadoop Community with Talend Open Studio for Big Data

From the post:

Talend, a global leader in open source integration software, today announced the availability of Talend Open Studio for Big Data, to be released under the Apache Software License. Talend Open Studio for Big Data is based on the world’s most popular open source integration product, Talend Open Studio, augmented with native support for Apache Hadoop. In addition, Talend Open Studio for Big Data will be bundled in Hortonworks’ leading Apache Hadoop distribution, Hortonworks Data Platform, constituting a key integration component of Hortonworks Data Platform, a massively scalable, 100 percent open source platform for storing, processing and analyzing large volumes of data.

Talend Open Studio for Big Data is a powerful and versatile open source solution for data integration that dramatically improves the efficiency of integration job design through an easy-to-use graphical development environment. Talend Open Studio for Big Data provides native support for Hadoop Distributed File System (HDFS), Pig, HBase, Sqoop and Hive. By leveraging Hadoop’s MapReduce architecture for highly-distributed data processing, Talend generates native Hadoop code and runs data transformations directly inside Hadoop for maximum scalability. This feature enables organizations to easily combine Hadoop-based processing, with traditional data integration processes, either ETL or ELT-based, for superior overall performance.

“By making Talend Open Studio for Big Data a key integration component of the Hortonworks Data Platform, we are providing Hadoop users with the ability to move data in and out of Hadoop without having to write complex code,” said Eric Baldeschwieler, CTO & co-founder of Hortonworks. “Talend provides the most powerful open source integration solution for enterprise data, and we are thrilled to be working with Talend to provide to the Apache Hadoop community such advanced integration capabilities.”
…..

Availability

Talend Open Studio for Big Data will be available in May 2012. A preview version of the product is available immediately at http://www.talend.com/download-tosbd.

Good news but we also know that the Hadoop paradigm is evolving: Tuple MapReduce: beyond the classic MapReduce.

Will early adopters of Hadoop be just as willing to migrate as the paradigm develops?

March 7, 2012

JavaScript Console and Excel Coming to Hadoop

Filed under: Excel,Hadoop,Javascript — Patrick Durusau @ 5:42 pm

JavaScript Console and Excel Coming to Hadoop

Alex Popescu (myNoSQL) has pointers to news of Hadoop on Windows Azure. Opens Hadoop up to Javascript developers and Excel/PowerPivot users.

Alex captures the winning strategy for new technologies when he says:

Think of integration with familiar tools and frameworks as a huge adoption accelerator.

What would it look like to add configurable merging on PowerPivot? (I may have to get a copy of MS Office 2010.)

March 6, 2012

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 8:10 pm

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video by Jon Zuanich.

From the post:

In this demo video, Philip Zeyliger, a software engineer at Cloudera, discusses the Activity Monitoring and Operational Reports in Cloudera Manager.

Activity Monitoring

The Activity Monitoring feature in Cloudera Manager consolidates all Hadoop cluster activities into a single, real-time view. This capability lets you see who is running what activities on the Hadoop cluster, both at the current time and through historical activity views. Activities are either individual MapReduce jobs or those that are part of larger workflows (via Oozie, Hive or Pig).

Operational Reports

Operational Reports provide a visualization of current and historical disk utilization by user, user groups and directory. In addition, it tracks MapReduce activity on the Hadoop cluster by job, user, group or job ID. These reports are aggregated over selected time periods (hourly, daily, weekly, etc.) and can be exported as XLS or CSV files.

It is a sign of Hadoop’s maturity that professional management interfaces have started to appear.

Hadoop has always been manageable. The question was how to find someone to marry your cluster? And what happened in the case of a divorce?

Professional management tools enable a less intimate relationship between your cluster and its managers. Not to mention the availability of a larger pool of managers for your cluster.

One request, please avoid the default security options on vimeo videos. They should be embeddable and downloadable in all cases.

February 23, 2012

Extending Hadoop beyond MapReduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:53 pm

Extending Hadoop beyond MapReduce

From the webpage:

Wednesday, March 7, 2012 10:00 am
Pacific Standard Time (San Francisco, GMT-08:00)

Description:

Hortonworks has been developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource management fabric to support MapReduce and other application paradigms such as Graph Processing, MPI etc. High-availability is built-in from the beginning; as are security and multi-tenancy to support multiple users and organizations on large, shared clusters. The new architecture will also increase innovation, agility and hardware utilization. NextGen MapReduce is already available in Hadoop 0.23. Join us for this webcast as we discuss the main architectural highlights of MapReduce and its utility to users and administrators.

I registered to learn more about the recent changes to Hadoop.

But I am also curious if the discussion is going to be “beyond MapReduce” as in the title or “the main architectural highlights of MapReduce” as in the last sentence. Hard to tell from the description.

February 18, 2012

Hadoop and Machine Learning: Present and Future

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 5:26 pm

Hadoop and Machine Learning: Present and Future by Josh Wills.

Presentation at LA Machine Learning.

Josh Wills is Cloudera’s Director of Data Science, working with customers and engineers to develop Hadoop-based solutions across a wide-range of industries. Prior to joining Cloudera, Josh worked at Google, where he worked on the ad auction system and then led the development of the analytics infrastructure used in Google+. Prior to Google, Josh worked at a variety of startups- sometimes as a Software Engineer, and sometimes as a Statistician. He earned his Bachelor’s degree in Mathematics from Duke University and his Master’s in Operations Research from The University of Texas – Austin.

Very practice oriented view of Hadoop and machine learning. If you aren’t excited about Hadoop and machine learning already, you will be after this presentation!

February 16, 2012

Cascading

Filed under: Cascading,Hadoop,MapReduce — Patrick Durusau @ 7:02 pm

Cascading

Since Cascading got called out today in the graph partitioning posts, thought it would not hurt to point it out.

From the webpage:

Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to ‘think’ in MapReduce.

Cascading is a thin Java library and API that sits on top of Hadoop’s MapReduce layer and is executed from the command line like any other Hadoop application.

As a library and API that can be driven from any JVM based language (Jython, JRuby, Groovy, Clojure, etc.), developers can create applications and frameworks that are “operationalized”. That is, a single deployable Jar can be used to encapsulate a series of complex and dynamic processes all driven from the command line or a shell. Instead of using external schedulers to glue many individual applications together with XML against each individual command line interface.

The Cascading API approach dramatically simplifies development, regression and integration testing, and deployment of business critical applications on both Amazon Web Services (like Elastic MapReduce) or on dedicated hardware.

February 14, 2012

Cloudera Manager | Service and Configuration Management Demo Videos

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:11 pm

Cloudera Manager | Service and Configuration Management Demo Videos by Jon Zuanich.

From the post:

Service and Configuration Management (Part I & II)

We’ve recently recorded a series of demo videos intended to highlight the extensive set of features and functions included with Cloudera Manager, the industry’s first end-to-end management application for Apache Hadoop. These demo videos showcase the newly enhanced Cloudera Manager interface and reveal how to use this powerful application to simplify the administration of Hadoop clusters, optimize performance and enhance the quality of service.

In the first two videos of this series, Philip Langdale, a software engineer at Cloudera, walks through Cloudera Manager’s Service and Configuration Management module. He demonstrates how simple it is to set up and configure the full range of Hadoop services in CDH (including HDFS, MR and HBase); enable security; perform configuration rollbacks; and add, delete and decommission nodes.

Interesting that Vimeo detects the “embedding” of these videos in my RSS reader and displays a blocked message. At the Cloudera site, all is well.

Management may not be as romantic as the latest graph algorithms but it is a pre-condition to widespread enterprise adoption.

Introducing CDH4

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:10 pm

Introducing CDH4 by Charles Zedlewski.

From the post:

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:

  • Availability – a high availability namenode, better job isolation, hard drive failure handling, and multi-version support
  • Utilization – multiple namespaces, co-processors and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce and compression performance
  • Usability – broader BI support, expanded API access, unified file formats & compression codecs
  • Security – scheduler ACL’s

Some items of note about this beta:

This is the first beta for CDH4. We plan to do a second beta some weeks after the first beta. The second beta will roll in updates to Apache Flume, Apache Sqoop, Hue, Apache Oozie and Apache Whirr that did not make the first beta. It will also broaden the platform support back out to our normal release matrix of Red Hat, Centos, Suse, Ubuntu and Debian. Our plan is for this second beta to have the last significant component changes before CDH goes GA.

Some CDH components are getting substantial revamps and we have transition plans for these. There is a significantly redesigned MapReduce (aka MR2) with a similar API to the old MapReduce but with new daemons, user interface and more. MR2 is part of CDH4, but we also decided it makes sense to ship with the MapReduce from CDH3 which is widely used, thoroughly debugged and stable. We will support both generations of MapReduce for the life of CDH4, which will allow customers and users to take advantage of all of the new CDH4 features while making the transition to the new MapReduce in a timeframe that makes sense for them.

The only better time to be in data mining, information retrieval, data analysis is next week. 😉

February 13, 2012

Big Data analytics with Hive and iReport

Filed under: Hadoop,Hive,iReport — Patrick Durusau @ 8:19 pm

Big Data analytics with Hive and iReport

From the post:

Each J.J. Abrams’ TV series Person of Interest episode starts with the following narration from Mr. Finch one of the leading characters: “You are being watched. The government has a secret system–a machine that spies on you every hour of every day. I know because…I built it.” Of course us technical people know better. It would take a huge team of electrical and software engineers many years to build such a high performing machine and the budget would be unimaginable… or wouldn’t be? Wait a second we have Hadoop! Now everyone of us can be Mr. Finch for a modest budget thanks to Hadoop.

In JCG article “Hadoop Modes Explained – Standalone, Pseudo Distributed, Distributed” JCG partner Rahul Patodi explained how to setup Hadoop. The Hadoop project has produced a lot of tools for analyzing semi-structured data but Hive is perhaps the most intuitive among them as it allows anyone with an SQL background to submit MapReduce jobs described as SQL queries. Hive can be executed from a command line interface, as well as run in a server mode with a Thrift client acting as a JDBC/ODBC interface giving access to data analysis and reporting applications.

In this article we will set up a Hive Server, create a table, load it with data from a text file and then create a Jasper Resport using iReport. The Jasper Report executes an SQL query on the Hive Server that is then translated to a MapReduce job executed by Hadoop.

Just in case you have ever wanted to play the role of “Big Brother.” 😉

On the other hand, the old adage about a good defense being a good offense may well be true.

Competing with other governments, organizations, companies, agencies or even inside them.

February 8, 2012

Calculating In-Degree using R MapReduce over Hadoop

Filed under: Hadoop,MapReduce,R — Patrick Durusau @ 5:12 pm

Calculating In-Degree using R MapReduce over Hadoop

Marko Rodriguez, the source of so many neat graph resources demonstrates:

How to use an R package for MapReduce over Hadoop to calculate vertex in-degree and concludes with a question: “Can you tell me how to calculate a graph’s degree distribution? — HINT: its this MapReduce job composed with another.”

Never one to allow a question to sit for very long, ;-), Marko supplies the answer and plots the results using R. (By the posting times, which may be horribly incorrect, Marko waited less than an hour before posting the answer. Moral here is that if Marko asks a question, answer early and often.)

February 4, 2012

MapReduce Patterns, Algorithms and Use Cases

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:32 pm

MapReduce Patterns, Algorithms and Use Cases

Ilya Katsov writes:

In this article I digested a number of MapReduce patterns and algorithms to give a systematic view of the different techniques that can be found in the web or scientific articles. Several practical case studies are also provided. All descriptions and code snippets use the standard Hadoop’s MapReduce model with Mappers, Reduces, Combiners, Partitioners, and sorting. This framework is depicted in the figure below.

An extensive list of MapReduce patterns and algorithms, complete with references at the end!

Forward to anyone interested in MapReduce.

February 3, 2012

Karmasphere Studio Community Edition

Filed under: Hadoop,Hive,Karmasphere — Patrick Durusau @ 4:52 pm

Karmasphere Studio Community Edition

From the webpage:

Karmasphere Studio Community Edition is the free edition of our graphical development environment that facilitates learning Hadoop MapReduce jobs. It supports the prototyping, developing, and testing phases of the Hadoop development lifecycle.

The parallel and parameterized queries features in their Analyst product attracted me to the site:

From the webpage:

According to Karmasphere, the updated version of Analyst offers a parallel query capability that they say will make it faster for data analysts to iteratively query their data and create visualizations. The company claims that the new update allows data analysts to submit queries, view results, submit a new set and then compare those results across the previous outputs. In essence, this means users can run an unlimited number of queries concurrently on Hadoop so that one or more data sets can be viewed while the others are being generated.

Karmasphere also says that the introduction of parameterized queries allows users to submit their queries as they go, while offering them output in easy-to-read graphical representations of the findings, in Excel spreadsheets, or across a number of other outside reporting tools.

Hey, it says “…in Excel spreadsheets,” do you think they are reading my blog? (Spreadsheet -> Topic Maps: Wrong Direction? 😉 I didn’t really think so either.) I do take that as validation of the idea that offering users a familiar interface is more likely to be successful than an unfamiliar one.

January 30, 2012

Big Data is More Than Hadoop

Filed under: BigData,Hadoop,Marketing,Topic Maps — Patrick Durusau @ 8:01 pm

Big Data is More Than Hadoop by David Menninger.

From the post:

We recently published the results of our benchmark research on Big Data to complement the previously published benchmark research on Hadoop and Information Management. Ventana Research undertook this research to acquire real-world information about levels of maturity, trends and best practices in organizations’ use of large-scale data management systems now commonly called Big Data. The results are illuminating.

Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data. Our research offers insight into each of these three categories. Regarding volume, over half the participating organizations process more than 10 terabytes of data, and 10% process more than 1 petabyte of data. In terms of velocity, 30% are producing more than 100 gigabytes of data per day. In terms of the variety of data, the most common types of big data are structured, containing information about customers and transactions.

However, one-third (31%) of participants are working with large amounts of unstructured data. Of the three V’s, nine out of 10 participants rate scalability and performance as the most important evaluation criteria, suggesting that volume and velocity of big data are more important concerns than variety.

This research shows that big data is not a single thing with one uniform set of requirements. Hadoop, a well-publicized technology for dealing with big data, gets a lot of attention (including from me), but there are other technologies being used to store and analyze big data.

Interesting work but especially for what the enterprises surveyed are missing about Big Data.

When I read “Volume, velocity and variety of data (the so-called three V’s) are often cited as characteristics of big data.” I was thinking that “variety” meant the varying semantics of the data. As is natural when collecting data from a variety of sources.

Nope. Completely off-base. “Variety” in the three V’s, at least for Ventura Research means:

The data being analyzed consists of a variety of data types. Rapidly increasing unstructured data and social media receive much of the attention in the big-data market, and the research shows these types of data are common among Hadoop users.

While the Ventura work is useful, at least for the variety leg of the Big Data stool, you will be better off with Ed Dumbill’s What is Big Data? where he points out for variety:

A common use of big data processing is to take unstructured data and extract ordered meaning, for consumption either by humans or as a structured input to an application. One such example is entity resolution, the process of determining exactly what a name refers to. Is this city London, England, or London, Texas? By the time your business logic gets to it, you don’t want to be guessing.

While data type variety is an issue, it isn’t one that is difficult to solve. Semantic variety on the other hand, is an issue that keeps on giving.

I think the promotional question for topic maps with regard to Big Data is: Do you still like the answer you got yesterday?

Topic maps can not only keep the question you asked yesterday and its answer, but the new question you want to ask today (and its answer). (Try that with fixed schemas.)

January 29, 2012

HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment

Filed under: Hadapt,Hadoop — Patrick Durusau @ 9:14 pm

HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment

From the post:

The buzz about Hadapt and HadoopDB has been around for a while now as it is one of the first systems to combine ideas from two different approaches, namely parallel databases based on a shared-nothing architecture and map-reduce, to address the problem of large scale data storage and analysis.

This early paper that introduced HadooDB crisply summarizes some reasons why parallel database solutions haven’t scaled to hundreds machines. The reasons include –

  1. As the number of nodes in a system increases failures become more common.
  2. Parallel databases usually assume a homogeneous array of machines which becomes impractical as the number of machines rise.
  3. They have not been tested at larger scales as applications haven’t demanded more than 10′s of nodes for performance until recently.

Interesting material to follow on the HPCC vs. Hadoop post.

Not to take sides, just the beginning of the type of analysis that will be required.

HPCC vs Hadoop

Filed under: Hadoop,HPCC — Patrick Durusau @ 9:12 pm

HPCC vs Hadoop

Four factors as said to distinguish HPCC from Hadoop:

  • Enterprise Control Language (ECL)
  • Beyond MapReduce
  • Roxie Delivery Engine
  • Enterprise Ready

After viewing these summaries you may feel like you like information on which to make a choice between these two.

So you follow: Detailed Comparison of HPCC vs. Hadoop.

I’m afraid you are going to be disappointed there as well.

Not enough information to make an investment choice in an enterprise context in favor of either HPCC or Hadoop.

Do you have pointers to meaningful comparisons of these two platforms?

Or perhaps suggestions for what would make a meaningful comparison?

Are there features of HPCC that Hadoop should emulate?

January 28, 2012

Microsoft’s plan for Hadoop and big data

Filed under: BigData,Hadoop,Microsoft — Patrick Durusau @ 10:54 pm

Microsoft’s plan for Hadoop and big data by Edd Dumbill.

From the post:

Microsoft has placed Apache Hadoop at the core of its big data strategy. It’s a move that might seem surprising to the casual observer, being a somewhat enthusiastic adoption of a significant open source product.

The reason for this move is that Hadoop, by its sheer popularity, has become the de facto standard for distributed data crunching. By embracing Hadoop, Microsoft allows its customers to access the rapidly-growing Hadoop ecosystem and take advantage of a growing talent pool of Hadoop-savvy developers.

Microsoft’s goals go beyond integrating Hadoop into Windows. It intends to contribute the adaptions it makes back to the Apache Hadoop project, so that anybody can run a purely open source Hadoop on Windows.

If MS is taking the data integration road, isn’t that something your company needs to be thinking about?

There is all that data diversity that Hadoop processing is going to uncover, but I have some suggestions about that issue. 😉

Nothing but good can come of MS using Hadoop as an integration data appliance. MS customers will benefit and parts of MS won’t have to worry about stepping on each other. A natural outcome of hard coding into formats. But that is an issue for another day.

About the Performance of Map Reduce Jobs

Filed under: Amazon Web Services AWS,Hadoop,MapReduce — Patrick Durusau @ 10:53 pm

About the Performance of Map Reduce Jobs by Michael Kopp.

From the post:

One of the big topics in the BigData community is Map/Reduce. There are a lot of good blogs that explain what Map/Reduce does and how it works logically, so I won’t repeat it (look here, here and here for a few). Very few of them however explain the technical flow of things, which I at least need, to understand the performance implications. You can always throw more hardware at a map reduce job to improve the overall time. I don’t like that as a general solution and many Map/Reduce programs can be optimized quite easily, if you know what too look for. And optimizing a large map/reduce jobs can be instantly translated into ROI!

The Word Count Example

I went over some blogs and tutorials about performance of Map/Reduce. Here is one that I liked. While there are a lot of good tips out there, none, except the one mentioned, talk about the Map/Reduce program itself. Most dive right into the various hadoop options to improve distribution and utilization. While this is important, I think we should start the actual problem we try to solve, that means the Map/Reduce Job.

To make things simple I am using Amazons Elastic Map Reduce. In my setup I started a new Job Flow with multiple steps for every execution. The Job Flow consisted of one master node and two task nodes. All of them were using the Small Standard instance.

While AWS Elastic Map/Reduce has its drawbacks in terms of startup and file latency (Amazon S3 has a high volatility), it is a very easy and consistent way to execute Map/Reduce jobs without needing to setup your own hadoop cluster. And you only pay for what you need! I started out with the word count example that you see in every map reduce documentation, tutorial or Blog.

Yet another reason (other than avoiding outright failure) for testing your Map/Reduce jobs locally before in a pay-for-use environment. The better you understand the job and its requirements, the more likely you are to create an effective and cost-efficient solution.

Cloudera’s Hadoop Demo VM

Filed under: Hadoop — Patrick Durusau @ 7:31 pm

Cloudera’s Hadoop Demo VM

Cloudera has made Hadoop demo packages for VMware, KVM and VirtualBox.

I tried two of them this weekend and would not recommend either one of them.

1. VMware – After uncompressing the image, loads and runs easy enough (remember to up the RAM to 2 GB). The only problem comes when you try to run the Hadoop Tutorial as suggested at the image page. Path is wrong in the tutorial for the current release. Rather than 0.20.2-cdh3u1, it should read (in full) /usr/lib//hadoop-0.20/hadoop-0.20.2-cdh3u2-core.jar.

There are other path/directory issues, such as /usr/joe. No where to be seen in this the demo release.

BTW, the xterm defaults to a nearly unreadable blue color for directories. If you try to reset it, you will think there is no change. Try “xterm” to spawn a new xterm window. Your changes will appear there. Think about it for a minute and it will make sense.

2. VirtualBox – Crashes every time I run it. I have three other VMs that work just fine so I suspect it isn’t my setup.

Not encouraging and rather disappointing.

I normally just install Cloudera releases on my Ubuntu box and have always been pleased with the results. Hence the expectation of a good experience with the Demo VM’s.

Demo VM’s are to supposed to entice users to experience the full power of Hadoop, not drive them away.

I would either fix the Demo VM’s to work properly and have pre-installed directories and resources to illustrate the use of Hadoop or stop distributing them.

Just a little QA goes a long way.

January 27, 2012

Seismic Data Science: Reflection Seismology and Hadoop

Filed under: Hadoop,Science — Patrick Durusau @ 4:32 pm

Seismic Data Science: Reflection Seismology and Hadoop by Josh Wills.

From the post:

When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.

A nice overview of areas of science using “big data” decades before the current flurry of activity. The use of Hadoop in reflection seismology is only one fuller example of that use.

The take away that I have from this post is that Hadoop skills are going to be in demand across business, science and one would hope, the humanities.

Building a Scalable Web Crawler with Hadoop

Filed under: Hadoop,Webcrawler — Patrick Durusau @ 4:31 pm

Building a Scalable Web Crawler with Hadoop

Ahad Rana of Common Crawl presents an architectural view of a web crawler based on Hadoop.

You can access the data from Common Crawl.

But the architecture notes may be useful if you decide to crawl a sub-part of the web and/or you need to crawl “deep web” data in your organization.

January 26, 2012

Measuring User Retention with Hadoop and Hive

Filed under: Hadoop,Hive — Patrick Durusau @ 6:53 pm

Measuring User Retention with Hadoop and Hive by Daniel Russo.

From the post:

The Hadoop ecosystem is comprised of numerous tech­nologies that can work together to provide a powerful and scalable mech­anism for analyzing and deriving insight from large quan­tities of data.

In an effort to showcase the flex­i­bility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demon­strate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Compelling demonstration of the power of Hadoop and Hive to measure raw user retention, in an “app” situation.

Question:

User retention isn’t a new issue, does anyone know what strategies were used before Hadoop and Hive to measure it?

The reason I ask is that prior analysis of user retention may point the way towards data or relationships it wasn’t possible to capture before.

For example, when an app falls into non-use or is uninstalled, what impact (if any) does that have on known “friends” and their use of the app?

Are there any patterns to non-use/uninstalls over short or long periods of time in identifiable groups? (A social behavior type question.)

January 25, 2012

Berlin Buzzwords 2012

Filed under: BigData,Conferences,ElasticSearch,Hadoop,HBase,Lucene,MongoDB,Solr — Patrick Durusau @ 3:24 pm

Berlin Buzzwords 2012

Important Dates (all dates in GMT +2)

Submission deadline: March 11th 2012, 23:59 MEZ
Notification of accepted speakers: April 6st, 2012, MEZ
Publication of final schedule: April 13th, 2012
Conference: June 4/5. 2012

The call:

Call for Submission Berlin Buzzwords 2012 – Search, Store, Scale — June 4 / 5. 2012

The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics:

  • IR / Search – Lucene, Solr, katta, ElasticSearch or comparable solutions
  • NoSQL – like CouchDB, MongoDB, Jackrabbit, HBase and others
  • Large Data Processing – Hadoop itself, MapReduce, Cascading or Pig and relatives

Related topics not explicitly listed above are more than welcome. We are looking for presentations on the implementation of the systems themselves, technical talks, real world applications and case studies.

…(moved dates to top)…

High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters.

Here is your chance to experience summer in Berlin (Berlin Buzzwords 2012) and in Montreal (Balisage).

Seriously, both conferences are very strong and worth your attention.

January 24, 2012

LDIF – Linked Data Integration Framework (0.4)

Filed under: Hadoop,Heterogeneous Data,LDIF,Linked Data — Patrick Durusau @ 3:43 pm

LDIF – Linked Data Integration Framework (0.4)

Version 0.4 News:

Up till now, LDIF stored data purely in-memory which restricted the amount of data that could be processed. Version 0.4 provides two alternative implementations of the LDIF runtime environment which allow LDIF to scale to large data sets: 1. The new triple store backed implementation scales to larger data sets on a single machine with lower memory consumption at the expense of processing time. 2. The new Hadoop-based implementation provides for processing very large data sets on a Hadoop cluster, for instance within Amazon EC2. A comparison of the performance of all three implementations of the runtime environment is found on the LDIF benchmark page.

From the “About LDIF:”

The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.

Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.

With the addition of Hadoop based processing, definitely worth your time to download and see what you think of it.

Ironic that the problem it solves:

  1. data sources use a wide range of different RDF vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources.

already existed, prior to Linked Data as:

  1. data sources use a wide range of different vocabularies to represent data about the same type of entity.
  2. the same real-world entity, for instance a person or a place, is identified differently within different data sources.

So the Linked Data drill is to convert data, which already has these problems, into Linked Data, which will still have these problems, and then solve the problem of differing identifications.

Yes?

Did I miss a step?

January 19, 2012

All Your HBase Are Belong to Clojure

Filed under: Clojure,Hadoop,HBase — Patrick Durusau @ 7:41 pm

All Your HBase Are Belong to Clojure by

I’m sure you’ve heard a variation on this story before…

So I have this web crawler and it generates these super-detailed log files, which is great ‘cause then we know what it’s doing but it’s also kind of bad ‘cause when someone wants to know why the crawler did this thing but not that thing I have, like, literally gajigabytes of log files and I’m using grep and awk and, well, it’s not working out. Plus what we really want is a nice web application the client can use.

I’ve never really had a good solution for this. One time I crammed this data into a big Lucene index and slapped a web interface on it. One time I turned the data into JSON and pushed it into CouchDB and slapped a web interface on that. Neither solution left me with a great feeling although both worked okay at the time.

This time I already had a Hadoop cluster up and running, I didn’t have any experience with HBase but it looked interesting. After hunting around the internet, thought this might be the solution I had been seeking. Indeed, loading the data into HBase was fairly straightforward and HBase has been very responsive. I mean, very responsive now that I’ve structured my data in such a way that HBase can be responsive.

And that’s the thing: if you are loading literally gajigabytes of data into HBase you need to be pretty sure that it’s going to be able to answer your questions in a reasonable amount of time. Simply cramming it in there probably won’t work (indeed, that approach probably won’t work great for anything). I loaded and re-loaded a test set of twenty thousand rows until I had something that worked.

An excellent tutorial on Hadoop, HBase and Clojure!

First seen at myNoSQL but the URL is not longer working at in my Google Reader.

January 18, 2012

Hadoop World 2011 Videos and Slides Available

Filed under: Cloudera,Conferences,Hadoop — Patrick Durusau @ 7:51 pm

Hadoop World 2011 Videos and Slides Available

From the post:

Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, eloquently summarizes Hadoop World 2011 in these final remarks.

Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the Hadoop World presentation slides and videos for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.

Comments if you attended or suggestions of which ones to watch first?

« Newer PostsOlder Posts »

Powered by WordPress