Hadoop « Another Word For It

August 2, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III

Filed under: Bioinformatics,Biomedical,Hadoop,MapReduce — Patrick Durusau @ 9:23 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part III by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Hadoop and Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.

Biomedical researchers will be interested in the results but I am more interested in the observation that Hadoop makes it possible to retain results for ad hoc analysis.

Comments Off

August 1, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 7:19 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster – Part II by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

As mentioned in Part I, although Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop/Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.

Details of the setup for processing rat brain signals with Hadoop.

Looking back, I did not see any mention of data sets? Perhaps in part III?

Comments Off

July 31, 2012

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I

Filed under: Bioinformatics,Biomedical,Hadoop,Signal Processing — Patrick Durusau @ 4:54 pm

Processing Rat Brain Neuronal Signals Using A Hadoop Computing Cluster – Part I by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the introduction:

In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Hadoop and Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.

And:

Problem Statement

Prior to starting this work, Jadin had data previously gathered by himself and from neuroscience researchers who are interested in the role of the brain region called the hippocampus. In both rats and humans, this region is responsible for both spatial processing and memory storage and retrieval. For example, as a rat runs a maze, neurons in the hippocampus, each representing a point in space, fire in sequence. When the rat revisits a path, and pauses to make decisions about how to proceed, those same neurons fire in similar sequences as the rat considers the previous consequences of taking one path versus another. In addition to this binary-like firing of neurons, brain waves, produced by ensembles of neurons, are present in different frequency bands. These act somewhat like clock signals, and the phase relationships of these signals correlate to specific brain signal pathways that provide input to this sub-region of the hippocampus.

The goal of the underlying neuroscience research is to correlate the physical state of the rat with specific characteristics of the signals coming from the neural circuitry in the hippocampus. Those signal differences reflect the origin of signals to the hippocampus. Signals that arise within the hippocampus indicate actions based on memory input, such as reencountering previously encountered situations. Signals that arise outside the hippocampus correspond to other cognitive processing. In this work, we digitally signal process the individual neuronal signal output and turn it into spectral information related to the brain region of origin for the signal input.

If this doesn’t sound like a topic map related problem on your first read, what would you call the “…brain region of origin for the signal input[?]”

That is if you wanted to say something about it. Or wanted to associate information, oh, I don’t know, captured from a signal processing application with it?

Hmmm, that’s what I thought too.

Besides, it is a good opportunity for you to exercise your Hadoop skills. Never a bad thing to work on the unfamiliar.

Comments Off

July 28, 2012

The Coming Majority: Mainstream Adoption and Entrepreneurship [Cloud Gift Certificates?]

Filed under: Cloud Computing,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 6:22 pm

The Coming Majority: Mainstream Adoption and Entrepreneurship by James Locus.

From the post:

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential. In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate. Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation. Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility). Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

You really do need a local installation of Hadoop for experimenting.

But at the same time, having a minimal cloud account where you can whistle up some serious computing power isn’t a bad idea either.

That would make an interesting “back to school” or “holiday present for your favorite geek” sort of present. A “gift certificate” for so many hours/cycles a month on a cloud platform.

BTW, what projects would you undertake if barriers of access and capacity were diminished if not removed?

Comments Off

July 26, 2012

Understanding Apache Hadoop’s Capacity Scheduler

Filed under: Clustering (servers),Hadoop,MapReduce — Patrick Durusau @ 10:43 am

Understanding Apache Hadoop’s Capacity Scheduler by Arun Murthy

From the post:

As organizations continue to ramp the number of MapReduce jobs processed in their Hadoop clusters, we often get questions about how best to share clusters. I wanted to take the opportunity to explain the role of Capacity Scheduler, including covering a few common use cases.

Let me start by stating the underlying challenge that led to the development of Capacity Scheduler and similar approaches.

As organizations become more savvy with Apache Hadoop MapReduce and as their deployments mature, there is a significant pull towards consolidation of Hadoop clusters into a small number of decently sized, shared clusters. This is driven by the urge to consolidate data in HDFS, allow ever-larger processing via MapReduce and reduce operational costs & complexity of managing multiple small clusters. It is quite common today for multiple sub-organizations within a single parent organization to pool together Hadoop/IT budgets to deploy and manage shared Hadoop clusters.

Initially, Apache Hadoop MapReduce supported a simple first-in-first-out (FIFO) job scheduler that was insufficient to address the above use case.

Enter the Capacity Scheduler.

Shared Hadoop clusters?

So long as we don’t have to drop off our punch cards at the shared Hadoop cluster computing center I suppose that’s ok.

😉

Just teasing.

Shared Hadoop clusters are more cost effective and makes better use of your Hadoop specialists.

Comments Off

Why we build our platform on HDFS

Filed under: Cloudera,Hadoop,HDFS — Patrick Durusau @ 10:16 am

Why we build our platform on HDFS by Charles Zedlewski

Charles Zedlewski pushes the number of Hadoop competitors up to twelve:

It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity. I wanted to build on a few of E14’s points and add some of my own.

A recent GigaOm article presented 8 alternatives to HDFS. They actually missed at least 4 others. For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi). Appistry continues to market its HDFS alternative. I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well. HP Ibrix has also supported the HDFS API for some years now.

The GigaOm article implies that the presence of twelve other vendors promoting alternatives must speak to some deficiencies in HDFS for what else would motivate so many offerings? This really draws the incorrect conclusion. I would ask this:

What can we conclude from the fact that there are:

Best links I have for Hadoop competitors (for your convenience and additions):

Comments Off

July 25, 2012

Thinking about the HDFS vs. Other Storage Technologies

Filed under: Hadoop,HDFS,Hortonworks — Patrick Durusau @ 3:11 pm

Thinking about the HDFS vs. Other Storage Technologies by Eric Baldeschwieler.

Just to whet your interest (see Eric’s post for the details):

As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS. Here is why…

To compare HDFS to other technologies one must first ask the question, what is HDFS good at:

Extreme low cost per byte….

Very high bandwidth to support MapReduce workloads….

Rock solid data reliability….

A lively storage competition is a good thing.

A good opportunity to experiment with different storage strategies.

Comments (1)

July 16, 2012

Experimenting with MapReduce 2.0

Filed under: Hadoop,MapReduce 2.0 — Patrick Durusau @ 4:23 pm

Experimenting with MapReduce 2.0 by Ahmed Radwan.

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.

In the rest of this post, we shall first discuss the Client API, followed by configurations and testing considerations, and finally commenting on the new changes related to the Job History Server and Web Servlets. We will use the terms MR1 and MR2 to refer to MapReduce in Hadoop 1.0 and Hadoop 2.0, respectively.

How long MapReduce 2.0 remains in alpha is anyone’s guess. Suggest we start to learn about it before that status passes.

Comments (1)

Happy Birthday Hortonworks!

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:04 pm

Happy Birthday Hortonworks! by Eric Baldeschwieler.

From the post:

Last week was an important milestone for Hortonworks: our one year anniversary. Given all of the activity around Apache Hadoop and Hortonworks, it’s hard to believe it’s only been one year. In honor of our birthday, I thought I would look back to contrast our original intentions with what we delivered over the past year.

Hortonworks was officially announced at Hadoop Summit 2011. At that time, I published a blog on the Hortonworks Manifesto. This blog told our story, including where we came from, what motivated the original founders and what our plans were for the company. I wanted to address many of the important statements from this blog here:

Read the post in full to see Eric’s take on:

Hortonworks was formed to “accelerate the development and adoption of Apache Hadoop”. …

We are “committed to open source” and commit that “all core code will remain open source”. …

We will “make Apache Hadoop easier to install, manage and use”. …

We will “make Apache Hadoop more robust”. …

We will “make Apache Hadoop easier to integrate and extend”. …

We will “deliver an ever-increasing array of services aimed at improving the Hadoop experience and support in the growing needs of enterprises, systems integrators and technology vendors”. …

This has been a banner year for Hortonworks, the Hadoop ecosystem and everyone concerned with this rapidly developing area!

We are looking forward to the next year being more of same, except more so!

Comments Off

July 13, 2012

Hadoop: A Powerful Weapon for Retailers

Filed under: Analytics,Data Science,Hadoop — Patrick Durusau @ 4:15 pm

Hadoop: A Powerful Weapon for Retailers

From the post:

With big data basking in the limelight, it is no surprise that large retailers have been closely watching its development… and more power to them! By learning to effectively utilize big data, retailers can significantly mold the market to their advantage, making themselves more competitive and increasing the likelihood that they will come out on top as a successful retailer. Now that there are open source analytical platforms like Hadoop, which allow for unstructured data to be transformed and organized, large retailers are able to make smart business decisions using the information they collect about customers’ habits, preferences, and needs.

As IT industry analyst Jeff Kelly explained on Wikibon, “Big Data combined with sophisticated business analytics have the potential to give enterprises unprecedented insights into customer behavior and volatile market conditions, allowing them to make data-driven business decisions faster and more effectively than the competition.” Predicting what customers want to buy, without a doubt, affects how many products they want to buy (especially if retailers add on a few of those wonderful customer discounts). Not only will big data analytics prove financially beneficial, it will also present the opportunity for customers to have a more individualized shopping experience.

This all sounds very promising but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage. In order for retailers to thrive in the market, they must learn to manage and hone in on all (or at least most) of these facets of business, which can be difficult if you keep in mind the amount of data that each channel generates. Sam Sliman, president at Optimal Solutions Integration, summarizes it perfectly: “Transparency rules the day. Inconsistency turns customers away. Retailer missteps can be glaring and costly.” By making fast market decisions, retailers can increase sales, win and maintain customers, improve margins, and boost market share, but this can really only be done with the right business analytics tools.

Interesting but I disagree with “…but the difficulty lies in the fact that there are many channels in the consumer business now, such as online, in-store, call centers, mobile, social, etc., each with its own target-marketing advantage.”

That can be a difficulty, if you are not technically capable of effectively using information from different channels.

But there is a more fundamental difficulty. Having the capacity to use multiple channels of information is no guarantee of effective use of those channels of information.

You could buy your programming department a Cray supercomputer but that doesn’t mean they can make good use of it.

Same is true for collecting or having the software to process “big data.”

The real difficulty is the shortage of analytical skills to explore and exploit data. Computers and software can enhance but not create those analytical skills.

Analytical skills are powerful weapons for retailers.

Comments Off

July 10, 2012

The Hadoop Ecosystem, Visualized in Datameer

Filed under: Cloudera,Datameer,Hadoop,Visualization — Patrick Durusau @ 8:28 am

The Hadoop Ecosystem, Visualized in Datameer by Rich Taylor.

From the post:

In our last post, Christophe explained why Datameer uses D3.js to power our Business Infographic™ designer. I thought I would follow up his post showing how we visualized the Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.

Visualizations of the Hadoop Ecosystem are colorful, amusing, instructive, but probably not useful per se.

What is useful is the demonstration of that using Datameer 2.0 can drastically reduce the time required for you to make a visualization.

Which results in you having more time to explore and find visualizations that are useful as opposed to being visualizations for the sake of visualization.

We can all think of network (“hairball” was the technical term used in a paper I read recently) visualizations that would be useful if we were super-boy/girl but otherwise, not so much.

I first saw this at Cloudera.

Comments Off

July 9, 2012

Hadoop Streaming Made Simple using Joins and Keys with Python

Filed under: Hadoop,Python,Stream Analytics — Patrick Durusau @ 10:48 am

Hadoop Streaming Made Simple using Joins and Keys with Python

From the post:

There are a lot of different ways to write MapReduce jobs!!!

Sample code for this post https://github.com/joestein/amaunet

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Interesting post and good tips on data exploration. Can’t really query/process the unknown.

Suggestions of other data exploration examples? (Not so much processing the known but looking to “learn” about data sources.)

Comments Off

July 3, 2012

Apache Flume Development Status Update

Filed under: Flume,Hadoop — Patrick Durusau @ 4:51 pm

Apache Flume Development Status Update by Hari Shreedharan.

From the post:

Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.

In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.

Flume now has a high performance, persistent channel – the File Channel. This means if the agent fails for any reason before events committed by the source are not removed and the transaction committed by the sink, the events will reloaded from disk and can be taken when the agent starts up again. The events will only be removed from the channel when the transaction is committed by the sink. The File channel uses a Write Ahead Log to save events.

Among the other features that have been added to Flume is the ability to modify events “in flight.”

I would not construe “event” too narrowly.

Emails, tweets, arrivals, departures, temperatures, wind direction, speed, etc., can all be viewed as one or more “events.”

The merging and other implications of one or more event modifiers will be the subject of a future post.

Comments Off

July 2, 2012

Update on Apache Bigtop (incubating)

Filed under: Bigtop,Cloudera,Hadoop — Patrick Durusau @ 6:32 pm

Update on Apache Bigtop (incubating) by Charles Zedlewski.

If you are curious about Apache Bigtop or how Cloudera manages to distribute stable distributions of the Hadoop ecosystem, this is the post for you.

Just to whet your appetite:

From the post:

Ever since Cloudera decided to contribute the code and resources for what would later become Apache Bigtop (incubating), we’ve been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care? The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that “Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem”. That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation’s (ASF) Hadoop ecosystem projects, yet it doesn’t really help you understand the aspirations of Bigtop.

…

Building and supporting CDH taught us a great deal about what was required to be able to repeatedly assemble a truly integrated, Apache Hadoop based data management system. The build, testing and packaging cost was considerable, and we regularly observed that different projects made different design choices that made ongoing integration difficult. We also realized that more and more mission critical workload was running on CDH and the customer demand for stability, predictability and compatibility was increasing.

Apache Bigtop was part of our answer two solve these two different problems. Initiate an Apache open source project that focused on creating the testing and integration infrastructure of an Apache-Hadoop based distribution. With it we hoped that:

We could better collaborate within the extended Apache community to contribute to resolving test, integration & compatibility issues across projects

We could create a kind of developer-focused distribution that would be able to release frequently, unencumbered by the enterprise expectations for long-term stability and compatibility.

…

See the post for details.

PS: The project is picking up speed and looking for developers/contributors.

Comments Off

July 1, 2012

HBase I/O – HFile

Filed under: Hadoop,HBase,HFile — Patrick Durusau @ 4:46 pm

HBase I/O – HFile by Matteo Bertozzi.

From the post:

Introduction

Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access.

Wait wait? random, realtime read/write access?

How is that possible? Is not Hadoop just a sequential read/write, batch processing system?

Yes, we’re talking about the same thing, and in the next few paragraphs, I’m going to explain to you how HBase achieves the random I/O, how it stores data and the evolution of the HBase’s HFile format.

Hadoop I/O file formats

Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. The only operation allowed is append, and if you want to lookup a specified key, you’ve to read through the file until you find your key.

As you can see, you’re forced to follow the sequential read/write pattern… but how is it possible to build a random, low-latency read/write access system like HBase on top of this?

To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index. This allows for quite a fast lookup, since instead of scanning all the records you scan the index which has less entries. Once you’ve found your block, you can then jump into the real data file.

A couple of important lessons:

First, file formats evolve. They shouldn’t be entombed by programming code, no matter how clever your code may be. That is what “versions” are for.

Second, the rapid evolution of the Hadoop ecosystem makes boundary observations strictly temporary. Wait a week or so, they will change!

Comments Off

Apache Oozie (incubating) 3.2.0 release

Filed under: Hadoop,Oozie — Patrick Durusau @ 4:46 pm

Apache Oozie (incubating) 3.2.0 release

From the post:

This blog was originally posted on the Apache Blog for Oozie.

In June 2012, we released Apache Oozie (incubating) 3.2.0. Oozie is currently undergoing incubation at The Apache Software Foundation (see http://incubator.apache.org/oozie).

Oozie is a workflow scheduler system for Apache Hadoop jobs. Oozie Workflows are Directed Acyclical Graphs (DAGs), and they can be scheduled to run at a given time frequency and when data becomes available in HDFS.

Oozie 3.1.3 was the first incubating release. Oozie 3.1.3 added Bundle job capabilities to Oozie. A bundle job is a collection of coordinator jobs that can be managed as a single application. This is a key feature for power users that need to run complex data-pipeline applications.

Oozie 3.2.0 is the second incubating release, and the first one to include features and fixes done in the context of the Apache Community. The Apache Oozie Community is growing organically with more users, more contributors, and new committers. Speaking as one of the initial developers of Oozie, it is exciting and fulfilling to see the Apache Oozie project gaining traction and mindshare.

While Oozie 3.2.0 is a minor upgrade, it adds significant new features and fixes that make the upgrade worthwhile. Here are the most important new features:

Support for Hadoop 2 (YARN Map-Reduce)

Built in support for new workflow actions: Hive, Sqoop, and Shell

Kerberos SPNEGO authentication for Oozie HTTP REST API and Web UI

Support for proxy-users in the Oozie HTTP REST API (equivalent to Hadoop proxy users)

Job ACLs support (equivalent to Hadoop job ACLs)

Tool to create and upgrade Oozie database schema (works with Derby, MySQL, Oracle, and PostgreSQL databases)

Improved Job information over HTTP REST API

New Expression Language functions for Workflow and Coordinator applications

Share library per action (including only the JARs required for the specific action)

Oozie 3.2.0 also includes several improvements for performance and stability, as well as bug fixes. And, as with previous Oozie releases, we are ensuring 100% backwards compatibility with applications written for previous versions of Oozie.

For those of you who know Michael Sperberg-McQueen, these are Directed Acyclical Graphs (DAGs) put to a useful purpose in an information environment. (Yes, that is an “insider” joke.)

Another important part of the Hadoop ecosystem.

Comments Off

June 28, 2012

Flexible Indexing in Hadoop

Filed under: Hadoop,Indexing — Patrick Durusau @ 6:31 pm

Flexible Indexing in Hadoop by Dmitriy Ryaboy.

Summarized by Russell Jumey as:

There was much excitement about Dmitriy Ryaboy’s talk about Flexible Indexing in Hadoop (slides available). Twitter has created a novel indexing system atop Hadoop to avoid “Looking for needles in haystacks with snowplows,” or – using mapreduce over lots of data to pick out a few records. Twitter Analytics’s new tool, Elephant Twin goes beyond folder/subfolder partitioning schemes used by many, for instance bucketizing data by /year/month/week/day/hour. Elephant Twin is a framework for creating indexes in Hadoop using Lucene. This enables you to push filtering down into Lucene, to return a few records and to dramatically reduce the records streamed and the time spent on jobs that only parse a small subset of your overall data. A huge boon for the Hadoop Community from Twitter!

The slides plus a slide-by-slide transcript of the presentation is available.

Going in the opposite direction of some national security efforts, which are creating bigger haystacks for the purpose of having larger haystacks.

There are a number of legitimately large haystacks in medicine, physics, astronomy, chemistry and any number of other disciplines. Grabbing all phone traffic to avoid saying you choose the < 5,000 potential subjects of interest is just bad planning.

Comments Off

Ambrose

Filed under: Ambrose,Hadoop,MapReduce,Pig — Patrick Durusau @ 6:31 pm

Ambrose

From the project page:

Twitter Ambrose is a platform for visualization and real-time monitoring of MapReduce data workflows. It presents a global view of all the map-reduce jobs derived from your workflow after planning and optimization. As jobs are submitted for execution on your Hadoop cluster, Ambrose updates its visualization to reflect the latest job status, polled from your process.

Ambrose provides the following in a web UI:

A chord diagram to visualize job dependencies and current state

A table view of all the associated jobs, along with their current state

A highlight view of the currently running jobs

An overall script progress bar

One of the items that Russell Jurney reports on in his summary of the Hadoop Summit 2012.

Limited to Pig at the moment but looks quite useful.

Comments Off

My Review of Hadoop Summit 2012

Filed under: Hadoop — Patrick Durusau @ 6:30 pm

My Review of Hadoop Summit 2012 by Russell Jumey.

I wasn’t present but given Russell’s comment, Hadoop Summit 2012 was a very exciting event.

I have been struggling with how to summarize an already concise post so I will just point you to Russell’s review of the conference.

There are a couple of items I will call out for special mention but in the mean time, go read the review.

Comments Off

June 26, 2012

Hadoop Beyond MapReduce, Part 1: Introducing Kitten

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:01 pm

Hadoop Beyond MapReduce, Part 1: Introducing Kitten by Josh Wills

From the post:

This week, a team of researchers at Google will be presenting a paper describing a system they developed that can learn to identify objects, including the faces of humans and cats, from an extremely large corpus of unlabeled training data. It is a remarkable accomplishment, both in terms of the system’s performance (a 70% improvement over the prior state-of-the-art) and its scale: the system runs on over 16,000 CPU cores and was trained on 10 million 200×200 pixel images extracted from YouTube videos.

Doug Cutting has described Apache Hadoop as “the kernel of a distributed operating system.” Until recently, Hadoop has been an operating system that was optimized for running a certain class of applications: the ones that could be structured as a short sequence of MapReduce jobs. Although MapReduce is the workhorse programming framework for distributed data processing, there are many difficult and interesting problems– including combinatorial optimization problems, large-scale graph computations, and machine learning models that identify pictures of cats– that can benefit from a more flexible execution environment.

Hadoop 0.23 introduced a substantial re-design of the core resource scheduling and task tracking system that will allow developers to create entirely new classes of applications for Hadoop. Cloudera’s Ahmed Radwan has written an excellent overview of the architecture of the new resource scheduling system, known as YARN. Hadoop’s open-source foundation and its broad adoption by industry, academia, and government labs means that, for the first time in history, developers can assume that a common platform for distributed computing will be available at organizations all over the world, and that there will be a market for applications that take advantage of that common platform to solve problems at scales that have never been considered before.

I suppose it would not be fair to point out that a human and fertile male/female couple could duplicate this feat without 10 million images from YouTube. 😉

And while YARN is a remarkable achievement, in the United States it isn’t possible to get federal agencies to share data, much less time on computing platforms. May be able to presume a common platform, but access, well, that may be a more difficult issue.

Comments Off

June 23, 2012

Big Data in Genomics and Cancer Treatment

Filed under: BigData,Genome,Hadoop,MapReduce — Patrick Durusau @ 6:48 pm

Big Data in Genomics and Cancer Treatment by Tanya Maslyanko.

From the post:

Why genomics?

Big data. These are two words the world has been hearing a lot lately and it has been in relevance to a wide array of use cases in social media, government regulation, auto insurance, retail targeting, etc. The list goes on. However, a very important concept that should receive the same (if not more) recognition is the presence of big data in human genome research.

Three billion base pairs make up the DNA present in humans. It’s probably safe to say that such a massive amount of data should be organized in a useful way, especially if it presents the possibility of eliminating cancer. Cancer treatment has been around since its first documented case in Egypt (1500 BC) when humans began distinguishing between malignant and benign tumors by learning how to surgically remove them. It is intriguing and scientifically helpful to take a look at how far the world’s knowledge of cancer has progressed since that time and what kind of role big data (and its management and analysis) plays in the search for a cure.

The most concerning issue with cancer, and the ultimate reason for why it still hasn’t been completely cured, is that it mutates differently for every individual and reacts in unexpected ways with people’s genetic make up. Professionals and researchers in the field of oncology have to assert the fact that each patient requires personalized treatment and medication in order to manage the specific type of cancer that they have. Elaine Mardis, PhD, co-director of the Genome Institute at the School of Medicine, believes that it is essential to identify mutations at the root of each tumor and to map their genetic evolution in order to make progress in the battle against cancer. “Genome analysis can play a role at multiple time points during a patient’s treatment, to identify ‘driver’ mutations in the tumor genome and to determine whether cells carrying those mutations have been eliminated by treatment.”

A not terribly technical but useful summary and pointers to the use of Hadoop in connection with genomics and cancer research/treatment. It may help give some substance to the buzz words “big data.”

Comments Off

June 21, 2012

Hortonworks Data Platform v1.0 Download Now Available

Filed under: Hadoop,HBase,HDFS,Hive,MapReduce,Oozie,Pig,Sqoop,Zookeeper — Patrick Durusau @ 3:36 pm

Hortonworks Data Platform v1.0 Download Now Available

From the post:

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

I’m a member of the general public! And you probably are too! 😉

See the rest of the post for more goodies that are included with this release.

Comments Off

June 20, 2012

HBase Write Path

Filed under: Hadoop,HBase,HDFS — Patrick Durusau @ 4:41 pm

HBase Write Path by Jimmy Xiang.

From the post:

Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created. So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.

The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.

Whether you intend to use Hadoop for topic map processing or not, this will be a good introduction to updating data in HBase. Not all applications using Hadoop are topic maps so this may serve you in other contexts as well.

Comments Off

June 17, 2012

MapR Now Available as an Option on Amazon Elastic MapReduce

Filed under: Amazon Web Services AWS,Hadoop,MapR,MapReduce — Patrick Durusau @ 3:59 pm

MapR Now Available as an Option on Amazon Elastic MapReduce

From the post:

MapR Technologies, Inc., the provider of the open, enterprise-grade distribution for Apache Hadoop, today announced the immediate availability of its MapR Distribution for Hadoop as an option within the Amazon Elastic MapReduce service. Customers can now provision dynamically scalable MapR clusters while taking advantage of the flexibility, agility and massive scalability of Amazon Web Services (AWS). In addition, AWS has made its own Hadoop enhancements available to MapR customers, allowing them to seamlessly use MapR with other AWS offerings such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB and Amazon CloudWatch.

“We’re excited to welcome MapR’s feature-rich distribution as an option for customers running Hadoop in the cloud,” said Peter Sirota, general manager of Amazon Elastic MapReduce, AWS. “MapR’s innovative high availability data protection and performance features combined with Amazon EMR’s managed Hadoop environment and seamless integration with other AWS services provides customers a powerful tool for generating insights from their data.”

Customers can provision MapR clusters on-demand and automatically terminate them after finishing data processing, reducing costs as they only pay for the resources they consume. Customers can augment their existing on-premise deployments with AWS-based clusters to improve disaster recovery and access additional compute resources as required.

“For many customers there is no longer a compelling business case for deploying an on-premise Hadoop cluster given the secure, flexible and highly cost effective platform for running MapR that AWS provides,” said John Schroeder, CEO and co-founder, MapR Technologies. “The combination of AWS infrastructure and MapR’s technology, support and management tools enables organizations to potentially lower their costs while increasing the flexibility of their data intensive applications.”

Are you doing topic maps in the cloud yet?

A rep from one of the “big iron” companies was telling me how much more reliable owning your own hardware with their software than the cloud.

True, but that has the same answer as the question: Who needs the capacity to process petabytes of data in real time?

If the truth were told, there are a few companies, organizations that could benefit from that capability.

But the rest of us don’t have that much data or the talent to process it if we did.

Over the summer I am going to try the cloud out, both generally and for topic maps.

Suggestions/comments?

Comments Off

June 15, 2012

VMware’s Project Serengeti And What It Means For Enterprise Hadoop

Filed under: Hadoop,MapReduce,Serengeti — Patrick Durusau @ 3:11 pm

VMware’s Project Serengeti And What It Means For Enterprise Hadoop by Chuck Hollis.

From the post:

Virtualize something — anything — and you make it easier for everyone to consume: IT vendors, enterprise IT organizations — and, most importantly, business users. The vending machine analogy is a powerful and useful one.

At a macro level, cloud is transforming IT, and virtualization is playing a starring role.

Enterprise-enhanced flavors of Hadoop are starting to earn prized roles in an ever-growing variety of enterprise applications. At a macro level, big data is transforming business, and Hadoop is playing an important role.

The two megatrends intersect nicely in VMware’s recently announced Project Serengeti: an encapsulation of popular Hadoop distros that make big data analytics tools far easier to deploy and consume in enterprise — or service provider — settings.

And if you’re interested in big data, virtualization, cloud et. al. — you’ll want to take a moment to get more familiar with what’s going on here.

Chuck has some really nice graphics and illustrations, pitched to a largely non-technical audience.

If you want the full monty, see: Project Serengeti: There’s a Virtual Elephant in my Datacenter by Richard McDougall.

The main project page for Serengeti.

User mailing list for Serengeti.

Comments Off

June 14, 2012

Why My Soap Film is Better than Your Hadoop Cluster

Filed under: Algorithms,Hadoop,Humor — Patrick Durusau @ 6:56 pm

Why My Soap Film is Better than Your Hadoop Cluster

From the post:

The ever amazing slime mold is not the only way to solve complex compute problems without performing calculations. There is another: soap film. Unfortunately for soap film it isn’t nearly as photogenic as slime mold, all we get are boring looking pictures, but the underlying idea is still fascinating and ten times less spooky.

As a quick introduction we’ll lean on Long Ouyang, who has really straightforward explanation of how soap film works in Approaching P=NP: Can Soap Bubbles Solve The Steiner Tree Problem In Polynomial.

And no, this isn’t what I am writing about on Hadoop for next Monday. 😉

I point this out partially for humor.

But considering unconventional computational methods may give you ideas about more conventional things to try.

Comments Off

The Elephant in the Enterprise

Filed under: Hadoop — Patrick Durusau @ 6:14 pm

The Elephant in the Enterprise by Jon Zuanich.

From the post:

On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.

Panelists included:

Michael Driscoll, CEO, Metamarkets

Andrew Mendelsohn, SVP, Oracle Server Technologies

Mike Olson, CEO, Cloudera

Jay Parikh, VP Infrastructure Engineering, Facebook

John Schroeder, CEO, MapR

By the end of the evening, this much was clear: Hadoop has arrived as a required technology. Whether provisioned in the cloud, on-premise, or using a hybrid approach, companies need Hadoop to harness the massive data volumes flowing through their organizations today and into the future. The power of Hadoop is due in part to the way it changes the economics of large-scale computing and storage, but — even more importantly — because it gives organizations a new platform to discover, analyze and ultimately monetize all of their data. To learn more about how market leaders view Hadoop and the reasons for its accelerated adoption into the heart of the enterprise, view the above video.

If you have time, try to watch this between now and next Monday.

I may be out most of the day but I have a post I will be working on over the weekend to post early Monday.

What is missing from this discussion of scale?

Comments Off

June 12, 2012

Introducing Hortonworks Data Platform v1.0

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 3:27 pm

Introducing Hortonworks Data Platform v1.0

John Kreisa writes:

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Hortonworks Data Platform was created to make it easier for organizations and solution providers to install, integrate, manage and use Apache Hadoop. It includes the latest stable versions of the essential Hadoop components in an integrated and tested package. Here is a diagram that shows the Apache Hadoop components included in Hortonworks Data Platform:

And I thought this was going to be a slow news week. 😉

Excellent news!

Comments (1)

Data distillation with Hadoop and R

Filed under: Data Mining,Data Reduction,Hadoop,R — Patrick Durusau @ 1:55 pm

Data distillation with Hadoop and R by David Smith.

From the post:

We’re definitely in the age of Big Data: today, there are many more sources of data readily available to us to analyze than there were even a couple of years ago. But what about extracting useful information from novel data streams that are often noisy and minutely transactional … aye, there’s the rub.

One of the great things about Hadoop is that it offers a reliable, inexpensive and relatively simple framework for capturing and storing data streams that just a few years ago we would have let slip though our grasp. It doesn’t matter what format the data comes in: without having to worry about schemas or tables, you can just dump unformatted text (chat logs, tweets, email), device “exhaust” (binary, text or XML packets), flat data files, network traffic packets … all can be stored in HDFS pretty easily. The tricky bit is making sense of all this unstructured data: the downside to not having a schema is that you can’t simply make an SQL-style query to extract a ready-to-analyze table. That’s where Map-Reduce comes in.

Think of unstructured data in Hadoop as being a bit like crude oil: it’s a valuable raw material, but before you can extract useful gasoline from Brent Sweet Light Crude or Dubai Sour Crude you have to put it through a distillation process in a refinery to remove impurities, and extract the useful hydrocarbons.

I may find this a useful metaphor because I grew up in Louisiana where land based oil wells were abundant and there was an oil reflinery only a couple of miles from my home.

Not a metaphor that will work for everyone but one you should keep in mind.

Comments Off

June 9, 2012

Hadoop Streaming Support for MongoDB

Filed under: Hadoop,Javascript,MapReduce,MongoDB,Python,Ruby — Patrick Durusau @ 7:13 pm

Hadoop Streaming Support for MongoDB

From the post:

MongoDB has some native data processing tools, such as the built-in Javascript-oriented MapReduce framework, and a new Aggregation Framework in MongoDB v2.2. That said, there will always be a need to decouple persistance and computational layers when working with Big Data.

Enter MongoDB+Hadoop: an adapter that allows Apache’s Hadoop platform to integrate with MongoDB.

[graphic omitted]

Using this adapter, it is possible to use MongoDB as a real-time datastore for your application while shifting large aggregation, batch processing, and ETL workloads to a platform better suited for the task.

[graphic omitted]

Well, the engineers at 10gen have taken it one step further with the introduction of the streaming assembly for Mongo-Hadoop.

What does all that mean?

The streaming assembly lets you write MapReduce jobs in languages like Python, Ruby, and JavaScript instead of Java, making it easy for developers that are familiar with MongoDB and popular dynamic programing languages to leverage the power of Hadoop.

I like that, “…popular dynamic programming languages…” 😉

Any improvement to increase usability without religious conversion (using a programming language not your favorite) is a good move.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 2, 2012

August 1, 2012

July 31, 2012

July 28, 2012

July 26, 2012

July 25, 2012

July 16, 2012

July 13, 2012

July 10, 2012

July 9, 2012

July 3, 2012

July 2, 2012

July 1, 2012

June 28, 2012

June 26, 2012

June 23, 2012

June 21, 2012

June 20, 2012

June 17, 2012

June 15, 2012

June 14, 2012

June 12, 2012

June 9, 2012