Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 12, 2012

Introducing DataFu: an open source collection of useful Apache Pig UDFs

Filed under: DataFu,Hadoop,MapReduce,Pig — Patrick Durusau @ 7:34 pm

Introducing DataFu: an open source collection of useful Apache Pig UDFs

From the post:

At LinkedIn, we make extensive use of Apache Pig for performing data analysis on Hadoop. Pig is a simple, high-level programming language that consists of just a few dozen operators and makes it easy to write MapReduce jobs. For more advanced tasks, Pig also supports User Defined Functions (UDFs), which let you integrate custom code in Java, Python, and JavaScript into your Pig scripts.

Over time, as we worked on data intensive products such as People You May Know and Skills, we developed a large number of UDFs at LinkedIn. Today, I’m happy to announce that we have consolidated these UDFs into a single, general-purpose library called DataFu and we are open sourcing it under the Apache 2.0 license:

Check out DataFu on GitHub!

DataFu includes UDFs for common statistics tasks, PageRank, set operations, bag operations, and a comprehensive suite of tests. Read on to learn more.

This is way cool!

Read the rest of Matthew’s post (link above) or get thee to GitHub!

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Filed under: Common Crawl,Hadoop,MapReduce — Patrick Durusau @ 7:32 pm

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

From the post:

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Any takers?

It will be this weekend but I will be reporting back next Monday.

January 10, 2012

Oracle: “Open Source isn’t all that weird” (Cloudera)

Filed under: Cloudera,Hadoop,Oracle — Patrick Durusau @ 8:12 pm

OK, maybe that’s not an exact word-for-word quotation. 😉

Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance

Ed Albanese (Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.) writes:

Summary: Oracle has selected Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager software as core technologies on the Oracle Big Data Appliance, a high performance “engineered system.” Oracle and Cloudera announced a multiyear agreement to provide CDH, Cloudera Manager, and support services in conjunction with Oracle Support for use on the Oracle Big Data Appliance.

Announced at Oracle Open World in October 2011, the Big Data Appliance was received with significant market interest. Oracle reported then that it would be released in the first half of 2012. Just 10 days into that period, Oracle has announced that the Big Data Appliance is available immediately.

The product itself is noteworthy. Oracle has combined Oracle hardware and software innovations with Cloudera technology to deliver what it calls an “engineered system.” Oracle has created several such systems over the past few years, including the Exadata, Exalogic, and Exalytics products. The Big Data Appliance combines Apache Hadoop with a purpose-built hardware platform and software that includes platform components such as Linux and Java, as well as data management technologies such as the Oracle NoSql database and Oracle integration software.

Read the post to get Ed’s take on what this will mean for both Cloudera and Oracle customers (positive).

I’m glad for Cloudera but also take this as validation of the overall Hadoop ecosystem. Not that it is appropriate for every application but where it is, it deserves serious consideration.

January 9, 2012

How Hadoop HDFS Works – Cartoon

Filed under: Hadoop — Patrick Durusau @ 1:38 pm

How Hadoop HDFS Works – Cartoon

Explanations don’t have to be complicated, they just have to be clear. Here is a clear one about HDFS. (click on the image for a full size view)

First saw this on myNoSQL.

An update on Apache Hadoop 1.0

Filed under: Hadoop — Patrick Durusau @ 1:35 pm

An update on Apache Hadoop 1.0 from Cloudera by Charles Zedlewski.

From the post:

Some users & customers have asked about the most recent release of Apache Hadoop, v1.0: what’s in it, what it followed and what it preceded. To explain this we should start with some basics of how Apache projects release software:

By and large, in Apache projects new features are developed on a main codeline known as “trunk.” Occasionally very large features are developed on their own branches with the expectation they’ll later merge into trunk. While new features usually land in trunk before they reach a release, there is not much expectation of quality or stability. Periodically, candidate releases are branched from trunk. Once a candidate release is branched it usually stops getting new features. Bugs are fixed and after a vote, a release is declared for that particular branch. Any member of the community can create a branch for a release and name it whatever they like.

About as clear an explanation of the Apache process and current state of Hadoop releases as is possible, given the facts Charles had to work with.

Still, for the average Cloudera release user I think something along the lines of:

There has been some confusion over the jump from 0.2* versions of Hadoop to a release of Hadoop 1.0 at Apache.

You have not missed various 0.3* and later releases!

Like political candidates, Apache releases can call themselves anything they like. The Hadoop project leaders decided to call a recent release Hadoop 1.0. Given the confusion this caused, maybe we will see more orderly naming in the future. Maybe not.

If you have CDH3, then you have all the features of the recent “Hadoop 1.0” and have had them for almost one year. (If you don’t have CDH3, you may wish to consider upgrading.)

(then conclude with)

[T]he CDH engineering team is comprised of more than 20 engineers that are committers and PMC members of the various Apache projects who can shape the innovation of the extended community into a single coherent system. It is why we believe demonstrated leadership in open source contribution is the only way to harness the open innovation of the Apache Hadoop ecosystem.

would have been better.

People are looking for a simple explanation with some reassurance that all is well.

January 8, 2012

Mr. MapR: A Xoogler

Filed under: Hadoop,MapR,MapReduce — Patrick Durusau @ 7:09 pm

Mr. MapR: A Xoogler

Cynthia Murrell of BeyondSearch writes:

Wired Enterprise gives us a glimpse into MapR, a new distribution for Apache Hadoop, in “Ex-Google Man Sells Search Genius to Rest of World.” The ex-Googler in this case is M.C. Srivas, who was so impressed with Google’s MapReduce platform that he decided to spread its concepts to the outside world.

Sounds great! So I head over to the MapR site and choose Unique Features of MapR Hadoop Distribution, where I find:

  • Finish small jobs quickly with MapR ExpressLane
  • Mount your Hadoop cluster with Direct Access NFS™
  • Enable realtime data flows
  • Use the MapR Heatmap™, alerts, and alarms to monitor your cluster
  • Manage your data easily with Volumes
  • Scale up and create an unlimited number of files
  • Get jobs done faster with half the hardware
  • Eliminate downtime and performance bottlenecks with Distributed NameNode HA
  • Eliminate lost jobs with HA Jobtracker
  • Enable Point-in-time Recovery with MapR Snapshots
  • Synchronize data across clusters with Mirroring
  • Let multiple jobs safely share your Hadoop cluster
  • Control data placement for improved performance, security or manageability

Maybe I am missing it. Do you see any Search Genius in that list?

MapR may have improved the usability/reliability of Hadoop, which is no small thing, but disappointing when looking for better search results.

Let’s represent the original Hadoop with this Wikipedia image:

Additor

and the MapR version of Hadoop with this Wikipedia image:

It is true that the MapR version has more unique features but none of them appear to relate to search.

I am sure that Hadoop cluster managers and others will be interested in MapR (as will some of the rest of us), as managers.

As searchers, we may have to turn somewhere else. Do you disagree?

PS: Cloudera has made more contributions to the Hadoop and Apache communities than I can list in a very long post. Keep than in mind when you see ill-mannered and juvenile sniping at their approach to Hadoop.

January 7, 2012

Caching in HBase: SlabCache

Filed under: Cloud Computing,Hadoop,HBase — Patrick Durusau @ 4:06 pm

Caching in HBase: SlabCache by Li Pi.

From the post:

The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.

However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.

Introduces management of the file system cache for those with loads and memory to justify and enable it.

Quite interesting work, particularly if you are ignoring the nay-sayers about the adoption of Hadoop and the Cloud in the coming year.

What the nay-sayers are missing is that yes, unimaginative mid-level managers and admins have no interest in Hadoop or the Cloud. What Hadoop and the Cloud present are opportunities that imaginative re-packagers and re-processing startups are going to use to provide new data streams and services.

Can’t ask startups that don’t exist yet why they have chosen to go with Hadoop and the Cloud.

That goes unnoticed by unimaginative commentators who reflect the opinions of uninformed managers, whose opinions are confirmed by the publication of the columns by unimaginative commentators. One of those feedback loops I mentioned earlier today.

January 6, 2012

Katta – Lucene & more in the cloud

Filed under: Hadoop,Katta,Lucene — Patrick Durusau @ 11:39 am

Katta – Lucene & more in the cloud

From the webpage:

Katta is a scalable, failure tolerant, distributed, data storage for real time access.

Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.

  • Makes serving large or high load indices easy
  • Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
  • Replicate shards on different servers for performance and fault-tolerance
  • Supports pluggable network topologies
  • Master fail-over
  • Fast, lightweight, easy to integrate
  • Plays well with Hadoop clusters
  • Apache Version 2 License

Now that the “new” has worn off of your holiday presents, ;-), something to play with over the weekend.

January 4, 2012

Hadoop for Archiving Email – Part 2

Filed under: Hadoop,Indexing,Lucene,Solr — Patrick Durusau @ 9:40 am

Hadoop for Archiving Email – Part 2 by Sunil Sitaula.

From the post:

Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.

Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:

Continues Part 1 (my blog post) and mentions several applications and libraries that will be useful for indexing email.

January 3, 2012

What’s New in Apache Sqoop 1.4.0-incubating

Filed under: Hadoop,Sqoop — Patrick Durusau @ 9:21 am

What’s New in Apache Sqoop 1.4.0-incubating

New features and improvements in the first incubating release:

If you are interested in learning more about the changes, a complete list for Sqoop 1.4.0-incubating can be found here.  You are also encouraged to give this new release a try.  Any help and feedback is more than welcome. For more information on how to report problems and to get involved, visit the Sqoop project website at http://incubator.apache.org/sqoop/.

BTW, “Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.” (From Apache Sqoop (incubating))

January 1, 2012

Gora Graduates!

Filed under: Cassandra,Hadoop,HBase,Hive,Lucene,MapReduce,Pig,Solr — Patrick Durusau @ 5:54 pm

Gora Graduates! (Incubator location)

Over Twitter I just saw a post announcing that Gora has graduated from the Apache Incubator!

Congratulations to all involved.

Oh, the project:

What is Gora?

Gora is an ORM framework for column stores such as Apache HBase and Apache Cassandra with a specific focus on Hadoop.

Why Gora?

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use ORM framework with data store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

  • Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
  • Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
  • Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
  • Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
  • MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

December 30, 2011

Hadoop Hits 1.0!

Filed under: Hadoop,HBase — Patrick Durusau @ 6:13 pm

Hadoop Hits 1.0!

From the news:

After six years of gestation, Hadoop reaches 1.0.0! This release is from the 0.20-security code line, and includes support for:

  • security
  • HBase (append/hsynch/hflush, and security)
  • webhdfs (with full support for security)
  • performance enhanced access to local files for HBase
  • other performance enhancements, bug fixes, and features

Please see the complete Hadoop 1.0.0 Release Notes for details.

With the release prior to this one being 0.22.0, I was reminded that of a publication by the Union of Concerned Scientists that had a clock on the cover, showing how close or how far away the world was to a nuclear “midnight.” Always counting towards midnight, except for one or more occasions when more time was added. The explanation I remember was that these were nuclear scientists, not clock experts. 😉

I am sure there will be some explanation for the jump in revisions that will pass into folklore and then into publications about Hadoop.

In the meantime, I would suggest that we all download copies and see what 2012 holds with Hadoop 1.0 under our belts.

December 28, 2011

Apache HBase 0.90.5 is now available

Filed under: Hadoop,HBase — Patrick Durusau @ 9:31 pm

Apache HBase 0.90.5 is now available

From Jonathan Hsieh at Cloudera:

Apache HBase 0.90.5 is now available. This is release of the scalable distributed data store inspired by Google’s BigTable is a fix release that covers 81 issue, including 5 considered blockers, and 11 considered critical. This release addresses several robustness and resource leakage issues, fixes rare data-loss scenarios having to do with splits and replication, and improves the atomicity of bulk loads. This version includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.

The 0.90.5 release is backward compatible with 0.90.4. Many of the fixes in this release will be included as part of CDH3u3.

I like the HBase page:

Welcome to Apache HBase!

HBase is the Hadoop database. Think of it as a distributed scalable Big Data store.

When Would I Use HBase?

Use HBase when you need random, realtime read/write access to your Big Data. This project’s goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Concise, to the point, either you are interested or you are not. Doesn’t waste time on hand wringing about “big data,” “oh, what shall we do?,” or parades of data horrors.

Do you think something similar for topic maps would need an application area approach? That is to focus on a particular problem deeply rather than all the possible uses of topic maps?

December 24, 2011

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:44 pm

Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011)

From the post:

The prior update of this posting was in May, and a lot has happened related to Mapreduce and Hadoop since then, e.g.

1) big software companies have started offering hadoop-based software (Microsoft and Oracle),

2) Hadoop-startups have raised record amounts, and

3) nosql-landscape becoming increasingly datawarehouse’ish and sql’ish with the focus on high-level data processing platforms and query languages.

Personally I have rediscovered Hadoop Pig and combine it with UDFs and streaming as my primary way to implement mapreduce algorithms here in Atbrox.

Best regards,

Amund Tveit (twitter.com/atveit)

Only new material in this post, which I think makes the listing easier to use.

But in any event, a very useful service to the community!

December 23, 2011

Machine Learning and Hadoop

Filed under: Hadoop,Machine Learning — Patrick Durusau @ 4:28 pm

Machine Learning and Hadoop

Interesting slide deck from Josh Wills,Tom Pierce, and Jeff Hammerbacher of Cloudera.

The mention of “Pareto optimization” reminds me of a debate tournament judge who had written his dissertation on that topic. Who carefully pointed out that it wasn’t possible to “know” how close (or far away) a society was from any optimal point. 😉 Oh well, it was a “case” that sounded good to opponents unfamiliar with economic theory at any rate. An example of those “critical evaluation” skills I was talking about a day or so ago.

Not that you can’t benefit from machine learning and Hadoop. You can, but ask careful questions and persist until you are given answers that make sense. To you. With demonstrable results.

In other words, don’t be afraid to ask “stupid” questions and keep on asking them until you are satisfied with the answers. Or hire someone who is willing to play that role.

December 18, 2011

Introducing Hadoop in 20 pages

Filed under: Hadoop,MapReduce — Patrick Durusau @ 8:36 pm

Introducing Hadoop in 20 pages by Prashant Sharma.

Getting started in hadoop for a newbie is a non trivial task, with amount of knowledge base available a significant amount of effort is gone in figuring out, where and how should one start exploring this field. Introducing hadoop in 20 pages is a concise document to briefly introduce just the right information in right amount, before starting out in-depth in this field. This document is intended to be used as a first and shortest guide to both understand and use Map-Reduce for building distributed data processing applications.

Well, counting the annexes it’s 35 pages but still useful. Could use some copy editing.

Disappointing because an introduction to the entire Hadoop ecosystem, carrying a single example, even an inverted index of a text, would be a better exercise at this point in Hadoop development. Two versions, one with the code examples at the end, for people who want to get a high level view and one with the code inline and commented, for people who want to code to follow along.

December 15, 2011

EMC Greenplum puts a social spin on big data

Filed under: BigData,Chorus,Facebook,Hadoop — Patrick Durusau @ 7:53 pm

EMC Greenplum puts a social spin on big data

From the post:

Greenplum, the analytics division of EMC, has announced new software that lets data analysts explore all their organization’s data and share interesting findings and data sets Facebook-style among their colleagues. The product is called Chorus, and it wraps around EMC’s Greenplum Database and Hadoop distribution, making all that data available for the data team work with.

The pitch here is about unifying the analytic database and Hadoop environments and making it as easy and collaborative as possible to work with data, since EMC thinks a larger percentage of employees will have to figure out how to analyze business data. Plus, because EMC doesn’t have any legacy database or business intelligence products to protect, the entire focus of the Greenplum division is on providing the best big-data experience possible.

From the Chorus product page:

Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from anywhere in the organization. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data. Customers deploy Chorus to create a self-service agile analytic infrastructure; teams can create workspaces on the fly with self-service provisioning, and then instantly start creating and sharing insights.

Chorus breaks down the walls between all of the individuals involved in the data science team and empowers everyone who works with your data to more easily collaborate and derive insight from that data.

Note to EMC Greenplum: If you want people to at least consider products, don’t hide them so that searching is necessary to find them. Just an FYI.

Resources is pretty thin but better than the blah-blah “more information page.” Could have more details, perhaps a demo version?

A button that says “Contact Sales” makes me loose interest real quick. I don’t need some software sales person pinging me during an editing cycle to know if I have installed the “free” software yet and am I ready to order? Buying software really should be on my schedule, not his/hers. Yes?

December 14, 2011

Cloudera Manager 3.7 released

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:46 pm

Cloudera Manager 3.7 released

From the post:

Cloudera Manager 3.7, a major new version of Cloudera’s Management applications for Apache Hadoop, is now available. Cloudera Manager Free Edition is a free download, and the Enterprise edition of Cloudera Manager is available as part of the Cloudera Enterprise subscription.

Cloudera Manager 3.7 includes several new features and enhancements:

  • Automated Hadoop Deployment – Cloudera Manager 3.7 allows you to install the complete Hadoop stack in minutes. We’ve now upgraded Cloudera Manager with the easy installation we first introduced in version 3.6 of SCM Express. (SCM Express is now replaced by Cloudera Manager Free Edition.).
  • Centralized Management UI – Version 3.5 of the Cloudera Management Suite included distinct modules for Resource Management, Activity Monitoring and Service and Configuration Management. In Cloudera Manager 3.7, all of these feature sets are now integrated into one centralized browser-based administration console.
  • Service & Configuration Management – We added several new configuration wizards to guide you in properly configuring HDFS and HBase host deployments, adding new hosts on demand, and adding/restarting services as needed. Cloudera Manager 3.7 now also manages Oozie and Hue.
  • Service Monitoring – Cloudera Manager monitors the health of your key Hadoop services—HDFS, HBase, MapReduce—and displays alerts on suspicious or bad health. For example, to determine the health of HDFS, Cloudera Manager measures the percentage of corrupt, missing, or under-replicated blocks. Cloudera Manager also checks if the NameNode is swapping memory or spending too much time in Garbage Collection, and whether HDFS has enough free space. Trends in relevant metrics can be visualized through time-series charts.
  • Log Search – You can search through all logs for Hadoop services across the whole cluster. You can also view results filtered by service, role, host, search phrase and log event severity.
  • Events and Alerts Cloudera Manager proactively reports on important events such as the change in a service’s health, detection of a log message of appropriate severity, or the slowness (or failure) of a job. Cloudera Manager aggregates the events for easy filtering and viewing, and you can configure Cloudera Manager to send email alerts.
  • Global Time Control – You can view the state of your system for any time period in the past. Combined with health state, events and log information, this feature serves as a powerful diagnostic tool.
  • Role-based Administration – Cloudera Manager 3.7 supports two types of users: admin users, who can change configs and execute commands and workflows; and read-only users, who can only monitor the system.
  • Configuration versioning and Audit trails – You can view a complete history of configuration changes with user annotations. You can roll-back to previous configuration states.
  • Activity Monitoring – The Activity Monitoring feature includes several performance and scale improvements.
  • Operational Reports – The ‘Resource Manager’ feature in the Cloudera Management Suite 3.5 is now in Cloudera Manager’s ‘Reports’ feature. You can visualize disk usage by user, group, and directory; you can track MapReduce activity on the cluster by job, or by user.
  • Support Integration – We’ve improved the Cloudera support experience by adding a feature that lets you send a snapshot of your cluster state to our support team for expedited resolution.
  • Cloudera Manager Free Edition and 1-click Upgrade – The Free Edition of Cloudera Manager includes a subset of the features described above. After you install Cloudera Manager Free Edition, you can easily upgrade to the Enterprise edition by entering a license key. Your data will be preserved as the Cloudera Manager wizard guides you through the upgrade.

You can download the new Cloudera Manager 3.7 at: https://ccp.cloudera.com/display/SUPPORT/Downloads . Check it out. We look forward to your feedback.

P.S. : We’re hiring! Visit: http://www.cloudera.com/company/careers

Makes you wonder where we will be in a year from now? Not just with Hadoop but algorithms for graphs, functional data structures, etc.

Care to make any forecasts?

December 10, 2011

Exploring Hadoop OutputFormat

Filed under: Hadoop — Patrick Durusau @ 8:06 pm

Exploring Hadoop OutputFormat by Jim.Blomo.

From the post:

Hadoop is often used as a part in a larger ecosystem of data processing. Hadoop’s sweet spot, batch processing large amounts of data, can best be put to use by integrating it with other systems. At a high level, Hadoop ingests input files, streams the contents through custom transformations (the Map-Reduce steps), and writes output files back to disk. Last month InfoQ showed how to gain finer control over the first step, ingestion of input files via the InputFormat class. In this article, we’ll discuss how to customize the final step, writing the output files. OutputFormats let you easily interoperate with other systems by writing the result of a MapReduce job in formats readable by other applications. To demonstrate the usefulness of OutputFormats, we’ll discuss two examples: how to split up the result of a job into different directories, and how to write files for a service providing fast key-value lookups.

One more set of tools to add to your Hadoop toolbox!

Scheduling in Hadoop

Filed under: Hadoop — Patrick Durusau @ 8:04 pm

Scheduling in Hadoop An introduction to the pluggable scheduler framework,
by M. Tim Jones, Consultant Engineer, Independent author

Summary:

Hadoop implements the ability for pluggable schedulers that assign resources to jobs. However, as we know from traditional scheduling, not all algorithms are the same, and efficiency is workload and cluster dependent. Get to know Hadoop scheduling, and explore two of the algorithms available today: fair scheduling and capacity scheduling. Also, learn how these algorithms are tuned and in what scenarios they’re relevant.

Not all topic maps are going to need Hadoop but enough will to make knowing the internals of Hadoop a real plus!

December 9, 2011

Crunch for Dummies

Filed under: Crunch,Flow-Based Programming (FBP),Hadoop — Patrick Durusau @ 8:21 pm

Crunch for Dummies by Brock Noland

From the post:

This guide is intended to be an introduction to Crunch.

Introduction

Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:

Interesting coverage of Crunch.

I don’t know that I agree with the characterization:

… using Hadoop …. require[s] learning a complex process called MapReduce or a higher level language such as Apache Hive or Apache Pig.

True, to use Hadoop means learning MapReduce or Hive or PIg but I don’t think of them as being all that complex. Besides, once you have learned them, the benefits are considerable.

But, to each his own.

You might also be interested in: Introducing Crunch: Easy MapReduce Pipelines for Hadoop.

December 8, 2011

Cloudera Manager

Filed under: Hadoop — Patrick Durusau @ 7:59 pm

Cloudera Manager: End-to-End Administration for Apache Hadoop

From the post:

Cloudera Manager is the industry’s first end-to-end management application for Apache Hadoop. With Cloudera Manager, you can easily deploy and centrally operate a complete Hadoop stack. The application automates the installation process, reducing deployment time from weeks to minutes; gives you a cluster wide, real time view of nodes and services running; provides a single, central place to enact configuration changes across your cluster; and incorporates a full range of reporting and diagnostic tools to help you optimize cluster performance and utilization.

This looks very cool!

I need to get some shelving for commodity boxes this coming year so I can test this sort of thing. 😉

December 6, 2011

Introducing Shep

Filed under: Hadoop,Shep,Splunk — Patrick Durusau @ 7:57 pm

Introducing Shep

From the post:

These are exciting times at Splunk, and for Big Data. During the 2011 Hadoop World, we announced our initiative to combine Splunk and Hadoop in a new offering. The heart of this new offering is an open source component called Shep. Shep is what will enable seamless two-way data-flow across the the systems, as well as opening up two-way compute operations across data residing in both systems.

Use Cases

The thing that intrigues us most is the synergy between Splunk and Hadoop. The ways to integrate are numerous, and as the field evolves and the project progresses, we can see more and more opportunities to provide powerful solutions to common problems.

Many of our customers are indexing terabytes per day, and have also spun up Hadoop initiatives in other parts of the business. Splunk integration with Hadoop is part of a broader goal at Splunk to break down barriers to data-silos, and open them up to availability across the enterprise, no matter what the source. To itemize some categories we’re focused on, listed here are some key use cases:

  • Query both Splunk and Hadoop data, using Splunk as a “single-pane-of-glass”
  • Data transformation utilizing Splunk search commands
  • Real-time analytics of data streams going to mutliple destinations
  • Splunk as data warehouse/marts for targeted exploration of HDFS data
  • Data acquisition from logs and apis via Splunk Universal Forwarder

Read the post to learn the features that are supported now or soon will be in Shep.

Now in private beta but it sounds worthy of a “heads up!”

November 25, 2011

Hadapt is moving forward

Filed under: Data Warehouse,Hadapt,Hadoop,Query Language — Patrick Durusau @ 4:27 pm

Hadapt is moving forward

A bullet-point type review, mostly a summary of information from the vendor. Not a bad thing, can be useful. But, you would think that when reviewing a vendor or their product, there would be a link to the vendor/product. Yes? No one that I can find in that post.

Let me make it easy for you: Hadapt.com. How hard was that? Maybe 10 seconds of my time and that is because I have gotten slow? The point of the WWW, at least as I understand it, is to make information more accessible to users. But it doesn’t happen by itself. Put in hyperlinks where appropriate.

There is a datasheet on the Adaptive Analytic Platform &trade:.

You can follow the link for the technical report and register, but it is little more than a sales brochure.

More informative is: Efficient Processing of Data Warehousing Queries in a Split Execution Environment.

I don’t have a local setup that would exercise Hadapt. If you do or if you are using it in the cloud, would appreciate any comments or pointers you have.

November 24, 2011

Front-end view generation with Hadoop

Filed under: Hadoop,Web Applications — Patrick Durusau @ 3:44 pm

Front-end view generation with Hadoop by Pere Ferrera.

From the post:

One of the most common uses for Hadoop is building “views”. The usual case is that of websites serving data in a front-end that uses a search index. Why do we want to use Hadoop to generate the index being served by the website? There are several reasons:

  • Parallelism: When the front-end needs to serve a lot of data, it is a good idea to divide them into “shards”. With Hadoop we can parallelize the creation of each of these shards so that both the generation of the view and service of it will be scaled and efficient.
  • Efficiency: In order to maximize the efficiency and the speed of a front-end, it is convenient to separate the generation from the serving of the view. The generation will be done by a back-end process whereas the serving will be done by a front-end; in this way we are freeing the front-end from the load that can be generated while indexing.
  • Atomicity: It is often convenient to have a method for generating and deploying views atomically. In this way, if the deployment fails, we can always go back to previous complete versions (rollback) easily. If the generation went badly we can always generate a new full view where the error will be solved in all the registers. Hadoop allows us to generate views atomically because it is batch-oriented. Some search engines / databases allow atomic deployment by doing a hot-swap of their data.

Covers use of Solr and Voldemort by example.

Concludes by noting this isn’t a solution for real-time updating but one suspects that isn’t a universal requirement across the web.

Plus see the additional resources suggested at the end of the post. You won’t (shouldn’t be) disappointed.

November 23, 2011

Coming Attractions: Apache Hive 0.8.0

Filed under: Hadoop,Hive — Patrick Durusau @ 7:52 pm

Coming Attractions: Apache Hive 0.8.0 by Carl Steinbach.

Apache Hive 0.8.0 won’t arrive for several weeks yet, but Carl’s preview covers:

  • Bitmap Indexes
  • TIMESTAMP datatype
  • Plugin Developer Kit
  • JDBC Driver Improvements

Are you interested now? Wondering what else will be included? Could always visit the Apache Hive project to find out. 😉

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

Filed under: Cyc,Hadoop,MapReduce — Patrick Durusau @ 7:50 pm

Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

From the post:

Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.

Background: Adverse Drug Events

An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.

Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.

Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the FDA uses the Adverse Event Reporting System (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced. The FDA makes a well-formatted sample of the reports available for download from their website, to the benefit of data scientists everywhere.

Methodology

Identifying ADEs is primarily a signal detection problem: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a 2×2 contingency table:

For All Drugs/Reactions:

Reaction = Rj

Reaction != Rj

Total

Drug = Di

A

B

A + B

Drug != Di

C

D

C + D

Total

A + C

B + D

A + B + C + D

Based on the values in the cells of the tables, we can compute various measures of disproportionality to find drug-reaction pairs that occur more frequently than we would expect if they were independent.

For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, “Empirical bayes screening for multi-item associations” by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and analyzing defects in automobiles.

Apologies for the long setup:

Solving the Hard Problem with Apache Hadoop

At first glance, it doesn’t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a combinatorial explosion in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 trillion potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative Expectation Maximization algorithm that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.

The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.

Where have I heard about combinatorial explosions before?

If you think about it, semantic environments (except for artificial ones) are inherently noisy and the signal we are looking for to trigger merging may be hard to find.

Semantic environments like Cyc are noise free, but they are also not the semantic environments in which most data exists and in which we have to make decisions.

Questions: To what extent are “clean” semantic environments artifacts of adapting to the capacities of existing hardware/software? What aspects of then current hardware/software would you point to in making that case?

Hadoop World 2011 Presentations

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:48 pm

Hadoop World 2011 Presentations

Slides from Hadoop World 2011 with videos of presentations following as quickly as possible.

A real treasure trove of Hadoop materials.

When the presentations are posted, look for annotated posts on some of them.

In the meantime, enjoy!

Building and Deploying MR2

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:45 pm

Building and Deploying MR2

From the post:

A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Hadoop 0.23.

A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).

In MR2, the JobTracker aggregated functionality is separated across two new components:

  1. Central Resource Manager (RM): Management of resources in the cluster.
  2. Application Master (AM): Management of the life cycle of an application and its tasks. Think of the AM as a per-application JobTracker.

The new design enables scaling Hadoop to run on much larger clusters, in addition to the ability to run non-mapreduce applications on the same Hadoop cluster. For more architecture details, the interested reader may refer to the design document at: https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf.

The objective of this blog is to outline the steps for building, configuring, deploying and running a single-node NextGen MR cluster.

…(see the post for the rest)

If you want to get a jump on experience with the next generation of Hadoop, here is a place to start!

Apache Hadoop 0.23 is Here!

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:38 pm

Apache Hadoop 0.23 is Here! by Arun Murthy.

Arun isolates two major improvements:

HDFS Federation

HDFS has undergone a transformation to separate out Namespace management from the Block (storage) management to allow for significant scaling of the filesystem – in the current architecture they are intertwined in the NameNode.

More details are available in the HDFS Federation release documentation or in the recent HDFS Federation talk by Suresh Srinivas, a Hortonworks co-founder at Hadoop World, 2011.

NextGen MapReduce aka YARN

MapReduce has undergone a complete overhaul in hadoop-0.23 with the fundamental change to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. Thus, Hadoop becomes a general purpose data-processing platform where we can support MapReduce and other application execution frameworks such as MPI etc.

More details are available in the YARN release documentation or in the recent YARN presentation by Mahadev Konar, a Hortonworks co-founder at Hadoop World, 2011.

Arun also notes that Hadoop 0.23 is an alpha release so don’t use this in a production environment (unless you are feeling lucky. Are you?)

More details at Hadoop World presentation.

So, in addition to a production quality Hadoop ecosystem you are going to need to setup a test Hadoop ecosystem. Well, winter is coming on and a couple of more boxes to heat the office won’t be a bad thing. 😉

« Newer PostsOlder Posts »

Powered by WordPress