Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 25, 2011

Hadoop Dont’s: What not to do to harvest Hadoop’s full potential

Filed under: Hadoop,Humor,NoSQL — Patrick Durusau @ 1:26 pm

Hadoop Dont’s: What not to do to harvest Hadoop’s full potential by Iwona Bialynicka-Birula.

From the post:

We’ve all heard this story. All was fine until one day your boss heard somewhere that Hadoop and No-SQL are the new black and mandated that the whole company switch over whatever it was doing to the Hadoop et al. technology stack, because that’s the only way to get your solution to scale to web proportions while maintaining reliability and efficiency.

So you threw away your old relational database back end and maybe all or part of your middle tier code, bought a couple of books, and after a few days of swearing got your first MapReduce jobs running. But as you finished re-implementing your entire solution, you found that not only is the system way less efficient than the old one, but it’s not even scalable or reliable and your meetings are starting more and more to resemble the Hadoop Downfall parody.

An excellent post on problems to avoid with Hadoop!

May 12, 2011

Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

Filed under: Hadoop,MapReduce,R — Patrick Durusau @ 7:56 am

Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

Antonio Piccolboni’s review has been summarized as:

  • Java Hadoop (mature and efficient, but verbose and difficult to program)
  • Cascading (brings an SQL-like flavor to Java programming with Hadoop)
  • Pipes/C++ (a C++ interface to programming on Hadoop)
  • Hive (a high-level SQL-like language for Hadoop, concise and expressive but limited in flexibility)
  • Pig (a new high-level langauge for Hadoop)
  • Rhipe (an R package for map-reduce programming with Hadoop)
  • Dumbo (a Hadoop library for python)
  • Cascalog (a powerful but obtuse lisp-based interface to Hadoop)

Read Piccolboni’s review for yourself and see what you think.

May 10, 2011

Brisk: Simpler, More Reliable, High-Performance Hadoop Solution

Filed under: Brisk,Cassandra,Hadoop — Patrick Durusau @ 3:30 pm

DataStax Releases Dramatically Simpler, More Reliable, High-Performance Hadoop Solution

From NoSQLDatabases coverage of Brisk a second generation Hadoop soltuion from Datastax.

From the post:

Today, DataStax, the commercial leader in Apache Cassandra™, released DataStax’ Brisk – a second-generation open-source Hadoop distribution that eliminates the key operational complexities with deploying and running Hadoop and Hive in production. Brisk is powered by Cassandra and offers a single platform containing a low-latency database for extremely high-volume web and real-time applications, while providing tightly coupled Hadoop and Hive analytics.

Download Brisk -> Here.

May 2, 2011

HCatalog, tables and metadata for Hadoop

Filed under: Hadoop,HCatalog — Patrick Durusau @ 10:33 am

HCatalog, tables and metadata for Hadoop

HCatolog is described at its Apache site as:

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:

  • Providing a shared schema and data type mechanism.
  • Providing a table abstraction so that users need not be concerned with where or how their data is stored.
  • Providing interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive.

From the post:

Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.

April 29, 2011

BigGarbage In -> BigGarbage Out

Filed under: Hadoop — Patrick Durusau @ 1:25 pm

Taking Hadoop Mainstream

This is just one example of any number of articles that lament how hard Hadoop is to explain to non-technical users.

Apparently there is an anticipated flood of applications that will have Hadoop “under the hood” so to speak that are due out later this year.

While I don’t doubt it will be true that enormous amounts of data will be analyzed by those applications, without some underlying understanding of the data, will the results be meaningful?

Note that I said the data and not Hadoop.

Understanding Hadoop is just a technicality.

An important one but whether one uses a cigar box with paper and pencil or the latest non-down cloud infrastructure with Hadoop, understanding the data and the operations to be performed upon it are far more important.

Processing large amounts of data will not be cheap and so the results of necessity will be seen as reliable. Yes? Or else we would not have spent all that money and you can see the answer to the problem is….

You can hear the future conversations as clearly as I can.

BigData simply means you have a big pile of data. (I forego the use of the other term.)

Whether you can extract meaningful results depends on the same factors as before the advent of “BigData.”

The principal one being an understanding of the data and its limitations. Which means human analysis of the data set and its gathering.

Data sets (large or not) are typically generated or used by staff and capturing their insights into particular aspects of a data set can be easily done using a topic map.

A topic map can collect and coordinate multiple views on the use and limitations of data sets.

Subsequent users don’t discover too late that a particular data set is unreliable or limited in some unforeseen way.

Hadoop is an important emerging technology subject to the rule:

BigGarbage In -> BigGarbage Out.

April 28, 2011

The Rise of Hadoop…

Filed under: Hadoop — Patrick Durusau @ 3:21 pm

The Rise of Hadoop: How many Hadoop-related solutions exist?

Alex Popescu of myNoSQL enhances a CMSWire listing of fourteen (14) different Hadoop solutions by adding pointers to most of the solutions.

Thanks to Alex for that!

It always puzzles me when “content providers” refer to an online site or software that can be reached online, but don’t include a link.

Easy enough to search and including the link takes time, but only once. Every reader is saved time by the presence of a link.


PS: From the CMSWire article:

Hadoop is Hard

Hadoop is not the most intuitive and easy-to-use technology. Many of the recent startups that have emerged to challenge Cloudera’s dominance have the exclusive value proposition that they make it easier to get answers from the software by abstracting the functions to higher-level products. But none of the companies has found the magic solution to bring the learning curve to a reasonable level.

Do you think topic maps could assist in ….bring[ing] the learning curve to a reasonable level?

If so, how?

April 24, 2011

Hadoop2010: Hadoop and Pig at Twitter

Filed under: Hadoop,Pig — Patrick Durusau @ 5:33 pm

Hadoop2010: Hadoop and Pig at Twitter video of presentation by Kevin Weil.

From the description:

Apache Pig is a high-level framework built on top of Hadoop that offers a powerful yet vastly simplified way to analyze data in Hadoop. It allows businesses to leverage the power of Hadoop in a simple language readily learnable by anyone that understands SQL. In this presentation, Twitter’s Kevin Weil introduces Pig and shows how it has been used at Twitter to solve numerous analytics challenges that had become intractable with a former MySQL-based architecture.

I started to simply do the listing for the Hadoop Summit 2010 but the longer I watched Kevin’s presentation, the more I thought it needed to be singled out.

If you don’t already know Pig, you will be motivated to learn Pig after this presentation.

Hadoop Summit 2010

Filed under: Conferences,Hadoop — Patrick Durusau @ 5:33 pm

Hadoop Summit 2010

Slides and some videos from the Hadoop Summit 2010 meeting.

April 22, 2011

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics

Filed under: Data Analysis,Hadoop,Indexing,MapReduce — Patrick Durusau @ 1:04 pm

Full-Text Indexing for Optimizing Selection Operations in Large-Scale Data Analytics by Jimmy Lin, Dmitriy Ryaboy, and Kevin Weil.

Abstract:

MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class citizens. Toward this end, this paper addresses one ineffcient aspect of Hadoop-based processing: the need to perform a full scan of the entire dataset, even in cases where it is clearly not necessary to do so. We show that it is possible to leverage a full-text index to optimize selection operations on text fields within records. The idea is simple and intuitive: the full-text index informs the Hadoop execution engine which compressed data blocks contain query terms of interest, and only those data blocks are decompressed and scanned. Experiments with a proof of concept show moderate improvements in end-to-end query running times and substantial savings in terms of cumulative processing time at the worker nodes. We present an analytical model and discuss a number of interesting challenges: some operational, others research in nature.

I always hope when I see first-class citizen(s) in CS papers that it is going to be talking about data structures and/or metadata (hopefully both).

Alas, I was disappointed once again but the paper is an interesting one and will repay close study.

Oh, the reason I mention treating data structures and metadata as first class citizens is then I can avoid the my way, your way or the highway sort of choices when it comes to metadata and formats.

Granted some formats maybe easier to use on some contexts, such as HDF5 (for space data), FITS (astronomical images), XML (for data and documents) or COBOL (for financial transactions), but if I can see formats as first class citizens, then I can map between them.

Not in a conversion sense, I can see them as though they are the same format as I prefer. Extract data from them, write data to them, etc.

Geo Analytics Tutorial – Where 2.0 2011

Filed under: Geo Analytics,Hadoop,Mechanical Turk,Pig — Patrick Durusau @ 1:04 pm

Geo Analytics Tutorial – Where 2.0 2011

Very cool set of slides on geo analytics from Pete Skomoroch.

Includes use of Hadoop, Pig, Mechanical Turk.

April 20, 2011

Adopting Apache Hadoop in the Federal Government

Filed under: Hadoop,Lucene,NoSQL,Solr — Patrick Durusau @ 2:17 pm

Adopting Apache Hadoop in the Federal Government

Background:

The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.

Thoughts on how to relate topic maps to technologies that already have their foot in the door?

April 12, 2011

Cloudera Hadoop V3 Features

Filed under: Hadoop — Patrick Durusau @ 12:03 pm

Cloudera Hadoop V3 Features

There isn’t any point in copying the long feature list for V3, go have a look for yourself!

And/or bookmark the www.cloudera.com homepage.

I am particularly interested in this release because it includes support for 64-bit Ubuntu.

April 9, 2011

Graph Exploration with Hadoop MapReduce

Filed under: Graphs,Hadoop,MapReduce — Patrick Durusau @ 3:41 pm

Graph Exploration with Hadoop MapReduce

From the post:

Hi all,

sometimes you will have data where you don’t know how elements of these data are connected. This is a common usecase for graphs, this is because they are really abstract.

So if you don’t know how your data is looking like, or if you know how it looks like and you just want to determine various graph components, this post is a good chance for you to get the “MapReduce-way” of graph exploration. As already mentioned in my previous post, I ranted about message passing through DFS and how much overhead it is in comparison to BSP.

I will have to keep an eye out for the Apache Hama BSP post.

April 8, 2011

Strategies for Exploiting Large-scale Data in the Federal Government

Filed under: Hadoop,Marketing — Patrick Durusau @ 7:19 pm

Strategies for Exploiting Large-scale Data in the Federal Government

Yes, that federal government. The one in the United States that is purportedly going to shut-down. Except that those responsible for the shutdown will still get paid. There’s logic in there somewhere or so I have been told.

Nothing specifically useful but more the flavor of conversations that are taking place where people have large datasets.

April 5, 2011

Solr + Hadoop = Big Data Love

Filed under: Hadoop,Solr — Patrick Durusau @ 4:27 pm

Solr + Hadoop = Big Data Love

Interesting combination, using Solr as a key/value store.

The article mentions that it is for “smaller” data sets and later says that approximately 200M “records” with reasonable response times.

That is something that gets overlooked in the rush to scale.

There are a lot of interesting data sets that are < 200M "records." The Library of Congress for example has 143 million items in its catalogs.

Perhaps your data set is < the Library of Congress?

April 4, 2011

Apache Hive 0.70 Released!

Filed under: Hadoop,Hive — Patrick Durusau @ 6:31 pm

Apache Hive 0.70 Released!

I count thirty-four (34) new features so I am not going to list them here. Improvements as well.

April 3, 2011

Introduction to Hadoop: Map Reduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 6:39 pm

Introduction to Hadoop: Map Reduce

Introduction to Hadoop by Steve Watt.

May not have the first principle verbatim correct but I liked:

Data must remain at rest.

The principle being that the work is moved to the data (due to the overhead of moving large amounts of data).

That would seem to also point in the direction of using functional programming principles as well.

April 2, 2011

Beyond MapReduce – Large Scale Graph Processing with GoldenOrb

Filed under: Graphs,Hadoop,MapReduce — Patrick Durusau @ 5:36 pm

Beyond MapReduce – Large Scale Graph Processing with GoldenOrb

While waiting for the open source release, I ran across this presentation about GoldenOrb by Zach Richardson, Co-Founder of Ravel Data.

Covers typical use cases, such as mining social networks and molecule modeling.

March 28, 2011

Do the Schimmy…

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:06 am

I first encountered the reference to the Do the Schimmy… posts at Alex Popescu’s myNoSQL site under Efficient Large-Scale Graph Analysis with Hadoop.

An excellent pair of articles on the use (and improvement of) Hadoop for graph processing.

Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop

Question: What do PageRank, the Kevin Bacon game, and DNA sequencing all have in common?

As you might know, PageRank is one of the many features Google uses for computing the importance of a webpage based on the other pages that link to it. The intuition is that pages linked from many important pages are themselves important. In the Kevin Bacon game, we try to find the shortest path from Kevin Bacon to your favorite movie star based on who they were costars with. For example, there is a 2 hop path from Kevin Bacon to Jason Lee: Kevin Bacon starred in A Few Good Men with Tom Cruise, whom also starred in Vanilla Star with Jason Lee. In the case of DNA sequencing, we compute the full genome sequence of a person (~3 billion nucleotides) from many short DNA fragments (~100 nucleotides) by constructing and searching the genome assembly graph. The assembly graph connects fragments with the same or similar sequences, and thus long paths of a particular form can spell out entire genomes.

The common aspect for these and countless other important problems, including those in defense & intelligence, recommendation systems & machine learning, social networking analysis, and business intelligence, is the need to analyze enormous graphs: the Web consists of trillions of interconnected pages, IMDB has millions of movies and movie stars, and sequencing a single human genome requires searching for paths between billions of short DNA fragments. At this scale, searching or analyzing a graph on a single machine would be time-consuming at best and totally impossible at worst, especially when the graph cannot possibly be stored in memory on a single computer.

Do the Schimmy: Efficient Large-Scale Graph Analysis with Hadoop, Part 2

In part 1, we looked at how extremely large graphs can be represented and analyzed in Hadoop/MapReduce. Here in part 2 we will examine this design in more depth to identify inefficiencies, and present some simple solutions that can be applied to many Hadoop/MapReduce graph algorithms. The speedup using these techniques is substantial: as a prototypical example, we were able to reduce the running time of PageRank on a webgraph with 50.2 million vertices and 1.4 billion edges by as much as 69% on a small 20-core Hadoop cluster at the University of Maryland (full details available here). We expect that similar levels of improvement will carry over to many of the other problems we discussed before (the Kevin Bacon game, and DNA sequence assembly in particular).

March 20, 2011

Next Generation of Apache Hadoop MapReduce – The Scheduler

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:23 pm

Next Generation of Apache Hadoop MapReduce – The Scheduler

From the post:

The previous post in this series covered the next generation of Apache Hadoop MapReduce in a broad sense, particularly its motivation, high-level architecture, goals, requirements, and aspects of its implementation.

In the second post in a series unpacking details of the implementation, we’d like to present the protocol for resource allocation and scheduling that drives application execution on a Next Generation Apache Hadoop MapReduce cluster.

See also: The Next Generation of Apache Hadoop MapReduce

The Next Generation of Apache Hadoop MapReduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:22 pm

The Next Generation of Apache Hadoop MapReduce

From the post:

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Start of an important series of posts on the next generation of Apache Hadoop MapReduce.

March 10, 2011

Mahout/Hadoop on Amazon EC2 – part 1 – Installation

Filed under: Hadoop,Mahout — Patrick Durusau @ 8:11 am

Mahout/Hadoop on Amazon EC2 – part 1 – Installation

The first of 6 posts where Danny Bickson walks through use of Mahout/Hadoop on Amazon EC2.

Other posts in the series:

Mahout on Amazon EC2 – part 2 – Running Hadoop on a single node

Mahout on Amazon EC2 – part 3 – Debugging

Hadoop on Amazon EC2 – Part 4 – Running on a cluster

Mahout on Amazon EC2 – part 5 – installing Hadoop/Mahout on high performance instance (CentOS/RedHat)

Tunning Hadoop configuration for high performance – Mahut on Amazon EC2

While you are here, take some time to look around. Lots of other interesting material on “distributed/parallel large scale algorithms and applications.”

March 1, 2011

NoSQL Databases: Why, what and when

NoSQL Databases: Why, what and when by Lorenzo Alberton.

When I posted RDBMS in the Social Networks Age I did not anticipate returning the very next day with another slide deck from Lorenzo. But, after viewing this slide deck, I just had to post it.

It is a very good overview of NoSQL databases and their underlying principles, with useful graphics as well (as opposed to the other kind).

I am going to have to study his graphic technique in hopes of applying it to the semantic issues that are at the core of topic maps.

February 25, 2011

RHIPE: An Interface Between Hadoop and R for Large and Complex Data Analysis

Filed under: Hadoop,R,RHIPE — Patrick Durusau @ 4:00 pm

RHIPE: An Interface Between Hadoop and R for Large and Complex Data Analysis

Enables processing with R across data sets too large to load.

But, you have to see the video to watch the retrieval from 14 GB of data that had been produced using RHIPE. Or the 145 GB of SSH traffic from the Department of Homeland Security.

Very impressive.

February 22, 2011

Luke

Filed under: Hadoop,Lucene,Maps,Marketing,Search Engines — Patrick Durusau @ 1:34 pm

Luke

From the website:

Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.

Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways:

  • browse by document number, or by term
  • view documents / copy to clipboard
  • retrieve a ranked list of most frequent terms
  • execute a search, and browse the results
  • analyze search results
  • selectively delete documents from the index
  • reconstruct the original document fields, edit them and re-insert to the index
  • optimize indexes
  • open indexes consisting of multiple parts, and located on Hadoop filesystem
  • and much more…

Searching is interesting and I have several more search engines to report this week, but the real payoff is finding.

And recording the finding so that other users can benefit from it.

We could all develop our own maps of the London Underground, at the expense of repeating the effort of others.

Or, we can purchase a copy of the London Underground.

Which one seems more cost effective for your organization?

February 19, 2011

Cascalog: Clojure-based Query Language for Hadoop – Post

Filed under: Cascalog,Hadoop — Patrick Durusau @ 4:34 pm

Cascalog: Clojure-based Query Language for Hadoop

From the post:

Cascalog, introduced in the linked article, is a query language for Hadoop featuring:

  • Simple – Functions, filters, and aggregators all use the same syntax. Joins are implicit and natural.
  • Expressive – Logical composition is very powerful, and you can run arbitrary Clojure code in your query with little effort.
  • Interactive – Run queries from the Clojure REPL.
  • Scalable – Cascalog queries run as a series of MapReduce jobs.
  • Query anything – Query HDFS data, database data, and/or local data by making use of Cascading’s “Tap” abstraction
  • Careful handling of null values – Null values can make life difficult. Cascalog has a feature called “non-nullable variables” that makes dealing with nulls painless.
  • First class interoperability with Cascading – Operations defined for Cascalog can be used in a Cascading flow and vice-versa
  • First class interoperability with Clojure – Can use regular Clojure functions as operations or filters, and since Cascalog is a Clojure DSL, you can use it in other Clojure code.

From Alex Popescu’s myNoSQL

There are a number of NoSQL query languages.

Which should be considered alongside TMQL4J in TMQL discussions.

February 18, 2011

The Next Generation of Apache Hadoop MapReduce

Filed under: Algorithms,Hadoop,MapReduce,NoSQL,Topic Maps — Patrick Durusau @ 5:02 am

The Next Generation of Apache Hadoop MapReduce by Arun C Murthy (@acmurthy)

From the webpage:

In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Since I posted the note about OpenStack and it is Friday, it seemed like a natural. Something to read over the weekend!

Saw this first at Alex Popescu’s myNoSQL – The Next Generation of Apache Hadoop MapReduce, which is sporting a new look!

February 10, 2011

Hadoop and MapReduce: Big Data Analytics (Gartner Report)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:58 pm

Hadoop and MapReduce: Big Data Analytics (Gartner Report)

I sacrificed my email address to view a copy of this “free” report from Gartner. Sponsored by Cloudera.

Care to guess what the second bulleted take away said?

Enterprises should consider adopting a packaged Hadoop distribution (e.g., Cloudera’s Distribution for Hadoop) to reduce the technical risk and increase speed of implementation of the Hadoop initiative.

The rest of it was don’t use a hair dryer while sitting in a bathtub full of water sort of advice.

Except tailored to Hadoop and MapReduce.

Save your email address for another day.

Spend your time at Cloudera, where you will find useful information about Hadoop and MapReduce.

January 26, 2011

Using Apache Avro – Repeatable/Shareable?

Filed under: Avro,Hadoop — Patrick Durusau @ 6:33 am

Using Apache Avro by Boris Lublinsky.

From the post:

Avro[1] is a recent addition to Apache’s Hadoop family of projects. Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.

Avro provides functionality that is similar to the other marshalling systems such as Thrift, Protocol Buffers, etc. The main differentiators of Avro include[2]:

  • “Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
  • Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
  • No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.”

I wonder about the symbolic resolution of differences using field names?

At least being repeatable and shareable.

By repeatable I mean that six months or even six weeks from now one understands the resolution. Not much use if the transformation is opaque to its author.

And shareable should mean that I can transfer the resolution to someone else who can then decide to follow, not follow or modify the resolution.

In another lifetime I was a sysadmin. I can count on less than one finger the number of times I would have followed a symbolic resolution that was not transparent. Simply not done.

Wait until the data folks, who must be incredibly trusting (anyone have some candy?), encounter someone who cares about critical systems and data.

Topic maps can help with that encounter.

January 25, 2011

Wukong, Bringing Ruby to Hadoop – Post

Filed under: Hadoop — Patrick Durusau @ 10:53 am

Wukong, Bringing Ruby to Hadoop

From the post:

Wukong is hands down the simplest (and probably the most fun) tool to use with hadoop. It especially excels at the following use case:

You’ve got a huge amount of data (let that be whatever size you think is huge). You want to perform a simple operation on each record. For example, parsing out fields with a regular expression, adding two fields together, stuffing those records into a data store, etc etc. These are called map only jobs. They do NOT require a reduce. Can you imagine writing a java map reduce program to add two fields together? Wukong gives you all the power of ruby backed by all the power (and parallelism) of hadoop streaming. Before we get into examples, and there will be plenty, let’s make sure you’ve got wukong installed and running locally.

Authoring a topic map is more than the final act of assembling the topic map. Any number of pre-assembly steps may be necessary before the final steps. Wukong is one more tool to assist in that process.

« Newer PostsOlder Posts »

Powered by WordPress