Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 30, 2012

HBase Replication Overview

Filed under: Cloudera,HBase — Patrick Durusau @ 3:22 pm

HBase Replication Overview by Himanshu Vashishtha.

From the post:

HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.

This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.

Use cases

HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.

Another replication use case is when a user wants to run load intensive MapReduce jobs on their HBase cluster; one can do so on the slave cluster while bearing a slight performance decrease on the master cluster.

So there is a non-romantic, sysadmin side to “big data.” I understand, no one ever even speaks unless something has gone wrong with the system. Sysadmins either get no contacts (a good thing) or pages, tweets, emails, phone calls and physical visits from irate users, managers, etc.

This post is a start towards always having the first case, no contacts. Leaves you more time for things that interest sysadmins. I won’t tell if you don’t.

July 26, 2012

Why we build our platform on HDFS

Filed under: Cloudera,Hadoop,HDFS — Patrick Durusau @ 10:16 am

Why we build our platform on HDFS by Charles Zedlewski

Charles Zedlewski pushes the number of Hadoop competitors up to twelve:

It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity. I wanted to build on a few of E14’s points and add some of my own.

A recent GigaOm article presented 8 alternatives to HDFS. They actually missed at least 4 others. For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi). Appistry continues to market its HDFS alternative. I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well. HP Ibrix has also supported the HDFS API for some years now.

The GigaOm article implies that the presence of twelve other vendors promoting alternatives must speak to some deficiencies in HDFS for what else would motivate so many offerings? This really draws the incorrect conclusion. I would ask this:

What can we conclude from the fact that there are:

Best links I have for Hadoop competitors (for your convenience and additions):

  1. Appistry
  2. Cassandra (DataStax)
  3. Ceph (Inktrack)
  4. Clustered Filesystem (CFS)
  5. Dispersed Storage Network (Cleversafe)
  6. GPFS (IBM)
  7. Ibrix
  8. Isilon (EMC)
  9. Lustre
  10. MapR File System
  11. NetApp Open Solution for Hadoop
  12. Parascale

July 20, 2012

Cloudera Manager 4.0.3 Released!

Filed under: Cloud Computing,Cloudera — Patrick Durusau @ 4:39 am

Cloudera Manager 4.0.3 Released! by Bala Venkatrao.

From the post:

We are pleased to announce the availability of Cloudera Manager 4.0.3. This is an enhancement release, with several improvements to configurability and usability. Some key enhancements include:

  • Configurable user/group settings for Oozie, HBase, YARN, MapReduce, and HDFS processes.
  • Support new configuration parameters for MapReduce services.
  • Auto configuration of reserved space for non-DFS use parameter for HDFS service.
  • Improved cluster upgrade process.
  • Support for LDAP users/groups that belong to more than one Organization Unit (OU).
  • Flexibility with distribution of key tabs when using existing Kerberos infrastructure (e.g. Active Directory).

Detailed release notes available at:

https://ccp.cloudera.com/display/ENT4DOC/New+Features+in+Cloudera+Manager+4.0

Cloudera Manager 4.0.3 is available to download from:

https://ccp.cloudera.com/display/SUPPORT/Downloads

Something for the weekend!

July 10, 2012

The Hadoop Ecosystem, Visualized in Datameer

Filed under: Cloudera,Datameer,Hadoop,Visualization — Patrick Durusau @ 8:28 am

The Hadoop Ecosystem, Visualized in Datameer by Rich Taylor.

From the post:

In our last post, Christophe explained why Datameer uses D3.js to power our Business Infographic™ designer. I thought I would follow up his post showing how we visualized the Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.

Visualizations of the Hadoop Ecosystem are colorful, amusing, instructive, but probably not useful per se.

What is useful is the demonstration of that using Datameer 2.0 can drastically reduce the time required for you to make a visualization.

Which results in you having more time to explore and find visualizations that are useful as opposed to being visualizations for the sake of visualization.

We can all think of network (“hairball” was the technical term used in a paper I read recently) visualizations that would be useful if we were super-boy/girl but otherwise, not so much.

I first saw this at Cloudera.

July 2, 2012

Update on Apache Bigtop (incubating)

Filed under: Bigtop,Cloudera,Hadoop — Patrick Durusau @ 6:32 pm

Update on Apache Bigtop (incubating) by Charles Zedlewski.

If you are curious about Apache Bigtop or how Cloudera manages to distribute stable distributions of the Hadoop ecosystem, this is the post for you.

Just to whet your appetite:

From the post:

Ever since Cloudera decided to contribute the code and resources for what would later become Apache Bigtop (incubating), we’ve been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care? The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that “Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem”. That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation’s (ASF) Hadoop ecosystem projects, yet it doesn’t really help you understand the aspirations of Bigtop.

Building and supporting CDH taught us a great deal about what was required to be able to repeatedly assemble a truly integrated, Apache Hadoop based data management system. The build, testing and packaging cost was considerable, and we regularly observed that different projects made different design choices that made ongoing integration difficult. We also realized that more and more mission critical workload was running on CDH and the customer demand for stability, predictability and compatibility was increasing.

Apache Bigtop was part of our answer two solve these two different problems. Initiate an Apache open source project that focused on creating the testing and integration infrastructure of an Apache-Hadoop based distribution. With it we hoped that:

  1. We could better collaborate within the extended Apache community to contribute to resolving test, integration & compatibility issues across projects
  2. We could create a kind of developer-focused distribution that would be able to release frequently, unencumbered by the enterprise expectations for long-term stability and compatibility.

See the post for details.

PS: The project is picking up speed and looking for developers/contributors.

June 5, 2012

CDH4 and Cloudera Enterprise 4.0 Now Available

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 7:58 pm

CDH4 and Cloudera Enterprise 4.0 Now Available by Charles Zedlewski.

From the post:

I’m very pleased to announce the immediate General Availability of CDH4 and Cloudera Manager 4 (part of the Cloudera Enterprise 4.0 subscription). These releases are an exciting milestone for Cloudera customers, Cloudera users and the open source community as a whole.

Functionality

Both CDH4 and Cloudera Manager 4 are chock full of new features. Many new features will appeal to enterprises looking to move more important workloads onto the Hadoop platform. CDH4 includes high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations.

Other features will appeal to developers and ISV’s looking to build applications on top of CDH and / or Cloudera Manager. HBase coprocessors enable the development of new kinds of real-time applications. MapReduce2 opens up Hadoop clusters to new data processing frameworks other that MapReduce. There are new REST API’s both for the Hadoop distributed filesystem and for Cloudera Manager.

Download and install. What new features you find the most interesting?

June 4, 2012

Cloudera Manager 3.7.6 released!

Filed under: Cloudera,Hadoop,HDFS,MapReduce — Patrick Durusau @ 4:34 pm

Cloudera Manager 3.7.6 released! by Jon Zuanich.

Jon writes:

We are pleased to announce that Cloudera Manager 3.7.6 is now available! The most notable updates in this release are:

  • Support for multiple Hue service instances
  • Separating RPC queue and processing time metrics for HDFS
  • Performance tuning of the Resource Manager components
  • Several bug fixes and performance improvements

The detailed Cloudera Manager 3.7.6 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes

Cloudera Manager 3.7.6 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads

Only fair since I mentioned the Cray earlier that I get a post about Cloudera out today as well.

May 17, 2012

Apache HBase 0.94 is now released

Filed under: Cloudera,HBase — Patrick Durusau @ 10:40 am

Apache HBase 0.94 is now released by Himanshu Vashishtha.

Some of the new features:

  • More powerful first aid box: The previous HBck tool did a good job of fixing inconsistencies related to region assignments but lacked some basic features like fixing orphaned regions, region holes, overlapping regions, etc. HBASE-5128: “Uber hbck”, adds these missing features to the first aid box.
  • Simplified Region Sizing: Deciding a region size is always tricky as it varies on a number of dynamic parameters such as data size, cluster size, workload, etc. HBASE-4365: “Heuristic for Region size” adds a heuristic where it increases the split size threshold of a table region as the data grows, thus limiting the number of region splits.
  • Smarter transaction semantics: Though HBase supports single row level transaction, if there are a number of updates (Puts/Deletes) to an individual row, it will lock the row for each of these operations. HBASE-3584: “Atomic Put & Delete in a single transaction” enhances the HBase single row locking semantics by allowing Puts and Deletes on a row to be executed in a single call. This feature is on by default.

BTW, also from the post:

Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).

Less than four (4) months as I count it between HBase 0.92 and 0.94.

Sounds like a lot of people have been working very hard.

And making serious progress.

May 14, 2012

Cloudera Manager 4.0 Beta released

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 8:49 am

Cloudera Manager 4.0 Beta released by Aparna Ramani

From the post:

We’re happy to announce the Beta release of Cloudera Manager 4.0.

This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition.

This is the last beta before the GA release.

The details are:

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

  • Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support
  • Utilization – multiple namespaces and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance
  • Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support
  • Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage
  • Security – HBase table & column level security and Zookeeper authentication support

Some items of note about this beta:

This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:

  • Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta
  • Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian
  • Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle
  • Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce
  • Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression

Not as romantic as your subject analysis activities but someone has to manage the systems that implement your analysis!

Not to mention skills here making you more attractive in any big data context.

April 30, 2012

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

Filed under: BigData,Cloudera,Couchbase,Hadoop,Humor,NoSQL — Patrick Durusau @ 3:18 pm

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

May 9, 2012 at 10am Pacific

From the webinar registration page:

In this webinar you will hear from Dr. Amr Awadallah, Co-Founder and CTO of Cloudera and James Phillips, Co-Founder and Senior VP of Products at Couchbase.

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. In this webinar, we will explore why every NoSQL deployment should be paired with a Big Data analytics solution.

In this session you will learn:

  • Why NoSQL and Big Data are similar, but different
  • The categories of NoSQL systems, and the types of applications for which they are best suited
  • How Couchbase and Cloudera’s Distribution Including Apache Hadoop can be used together to build better applications
  • Explore real-world use cases where NoSQL and Hadoop technologies work in concert

Have you ever wanted to suggest a survey to Gartner or the technology desk at the Wall Street Journal?

Asking c-suite types at Fortune 500 firms the following questions among others:

  • Is there a difference between NoSQL and Big Data?
  • What percentage of software projects failed at your company last year?

Could go a long way to explaining the persistent and high failure rate of software projects.

Catch the webinar. Always the chance you will learn how to communicate with c-suite types. Maybe.

April 25, 2012

Introducing CDH4 Beta 2

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 6:27 pm

Introducing CDH4 Beta 2

Charles Zedlewski writes:

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

  • Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support
  • Utilization – multiple namespaces and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance
  • Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support
  • Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage
  • Security – HBase table & column level security and Zookeeper authentication support

Some items of note about this beta:

This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:

  • Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta
  • Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian
  • Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle
  • Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce
  • Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression

Second (and final) beta?

Sounds like time to beat and beat hard on this one.

I suspect feedback will be appreciated!

April 12, 2012

Sqoop Graduation Meetup

Filed under: Cloudera,Sqoop — Patrick Durusau @ 9:23 am

Sqoop Graduation Meetup by Kathleen Ting.

From the post:

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption.

Not only was this Sqoop’s second Meetup but also a celebration for Sqoop’s graduation from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop’s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: “Sqoop: The Early Days.” From Aaron, we learned that Sqoop’s original name was “SQLImport” and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time. (Emphasis added.)

I don’t think the extra 20 people were present because of Sqoop.

Did you see the picture of the cake?

My vote goes for the cake as explanation. Yours? 😉

Congratulations to Sqoop, Sqoop team and community!

Let’s make sure on its first birthday a bigger cake is required!

April 10, 2012

HBase Hackathon at Cloudera

Filed under: Cloudera,HBase — Patrick Durusau @ 6:45 pm

HBase Hackathon at Cloudera by David S. Wang

From the post:

Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012. The overall theme of the event will be 0.96 stabilization. If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon. This is a great opportunity to contribute some code towards the project and hang out with other HBasers.

More details are on the hackathon’s Meetup page. Please RSVP so we can better plan lunch, room size, and other logistics for the event. See you there!

If you get the opportunity, attend.

Studies show (American Library Association) that building social relationships that are then continued helps to sustain virtual communities.

Here is your chance to get to know other HBase folks.

March 24, 2012

Cloudera Manager 3.7.4 released! (spurious alerts?)

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:36 pm

Cloudera Manager 3.7.4 released! by Bala Venkatrao.

From the post:

We are pleased to announce that Cloudera Manager 3.7.4 is now available! The most notable updates in this release are:

  • A fixed memory leak in supervisord
  • Compatibility with a scheduled refresh of CDH3u3
  • Significant improvements to the alerting functionality, and the rate of ‘false positive alerts’
  • Support for several new multi-homing features
  • Updates to the default heap sizes for the management daemons (these have been increased).

The detailed Cloudera Manager 3.7.4 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes

Cloudera Manager 3.7.4 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads

I admit to being curious (or is that suspicious?) and so when I read ‘false positive alerts’, I had to consult the release notes:

  • Some of the alerting behaviors have changed, including selected default settings. This has streamlined some of the alerting behavior and avoids spurious alerts in certain situations. These changes include:
    • The default alert values have been changed so that summary level alerts are disabled by default, to avoid unnecessary email alerts every time an individual health check alert email is sent.
    • The default behavior for DataNodes and TaskTrackers is now to never emit alerts.
    • The “Job Failure Ratio Thresholds” parameter has been disabled by default. The utility of this test very much depends on how the cluster is used. This parameter and the “Job Failure Ratio Minimum Failing Jobs” parameters can be used to alert when jobs fail.

So, the alerts in question were not spurious alerts but alerts users of Cloudera Manager could not correctly configure?

Question: Can your Cloudera Manager users correctly configure alerts? (That could be a good Cloudera installation interview question. Use a machine disconnected from your network and the Internet for testing.)

Apache HBase 0.92.1 now available

Filed under: Cloudera,Hadoop,HBase — Patrick Durusau @ 7:35 pm

Apache HBase 0.92.1 now available by Shaneal Manek

From the post:

Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It’s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228.

Apache HBase 0.92.1 is a bug fix release covering 61 issues – including 6 blockers and 6 critical issues, such as:

March 6, 2012

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video

Filed under: Cloud Computing,Cloudera,Hadoop — Patrick Durusau @ 8:10 pm

Cloudera Manager | Activity Monitoring & Operational Reports Demo Video by Jon Zuanich.

From the post:

In this demo video, Philip Zeyliger, a software engineer at Cloudera, discusses the Activity Monitoring and Operational Reports in Cloudera Manager.

Activity Monitoring

The Activity Monitoring feature in Cloudera Manager consolidates all Hadoop cluster activities into a single, real-time view. This capability lets you see who is running what activities on the Hadoop cluster, both at the current time and through historical activity views. Activities are either individual MapReduce jobs or those that are part of larger workflows (via Oozie, Hive or Pig).

Operational Reports

Operational Reports provide a visualization of current and historical disk utilization by user, user groups and directory. In addition, it tracks MapReduce activity on the Hadoop cluster by job, user, group or job ID. These reports are aggregated over selected time periods (hourly, daily, weekly, etc.) and can be exported as XLS or CSV files.

It is a sign of Hadoop’s maturity that professional management interfaces have started to appear.

Hadoop has always been manageable. The question was how to find someone to marry your cluster? And what happened in the case of a divorce?

Professional management tools enable a less intimate relationship between your cluster and its managers. Not to mention the availability of a larger pool of managers for your cluster.

One request, please avoid the default security options on vimeo videos. They should be embeddable and downloadable in all cases.

March 2, 2012

Indexing Files via Solr and Java MapReduce

Filed under: Cloudera,Indexing,MapReduce,Solr — Patrick Durusau @ 8:04 pm

Indexing Files via Solr and Java MapReduce by Adam Smieszny.

From the post:

Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.

What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to Sematext for looking over the Solr bits and making sure they are sane. Check them out if you’re going to be doing a lot of work with Solr, ElasticSearch, or search in general and want to bring in the experts.

Looks like a nice weekend (if you are married, long night if not) project!

If you have the time, look over this post and report back on your experiences.

Particularly if you learn something new or see something others need to know about (such as other resources).

February 23, 2012

Cloudera Manager | Log Management, Event Management and Alerting Demo Video

Filed under: Cloudera — Patrick Durusau @ 4:52 pm

Cloudera Manager | Log Management, Event Management and Alerting Demo Video by Jon Zuanich

From the post:

In this demo, Henry Robinson, a software engineer at Cloudera, discusses the Log Management, Event Management and Alerting features in Cloudera Manager that help make sense out of all the discrete events that take place across the Hadoop cluster. He demonstrates how to search the logs valuable information, note important events that pertain to system health and create alerts to warn you when things go wrong.

As a once upon a time sysadmin, I know making a system transparent for users is the result of a lot of unseen work.

I think I have less than 50 (fifty) nodes, ;-), so I will have to take this for a spin. I need to find some cheap commodity boxes so I can set up a 3 or 4 node test bed. I can try in on a single node system just to get my feet wet.

If you are using Cloudera Manager or try it out, point me to a blog with your comments.

February 14, 2012

Cloudera Manager | Service and Configuration Management Demo Videos

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:11 pm

Cloudera Manager | Service and Configuration Management Demo Videos by Jon Zuanich.

From the post:

Service and Configuration Management (Part I & II)

We’ve recently recorded a series of demo videos intended to highlight the extensive set of features and functions included with Cloudera Manager, the industry’s first end-to-end management application for Apache Hadoop. These demo videos showcase the newly enhanced Cloudera Manager interface and reveal how to use this powerful application to simplify the administration of Hadoop clusters, optimize performance and enhance the quality of service.

In the first two videos of this series, Philip Langdale, a software engineer at Cloudera, walks through Cloudera Manager’s Service and Configuration Management module. He demonstrates how simple it is to set up and configure the full range of Hadoop services in CDH (including HDFS, MR and HBase); enable security; perform configuration rollbacks; and add, delete and decommission nodes.

Interesting that Vimeo detects the “embedding” of these videos in my RSS reader and displays a blocked message. At the Cloudera site, all is well.

Management may not be as romantic as the latest graph algorithms but it is a pre-condition to widespread enterprise adoption.

Introducing CDH4

Filed under: Cloudera,Hadoop,HBase,HDFS,MapReduce — Patrick Durusau @ 5:10 pm

Introducing CDH4 by Charles Zedlewski.

From the post:

I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.

There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:

  • Availability – a high availability namenode, better job isolation, hard drive failure handling, and multi-version support
  • Utilization – multiple namespaces, co-processors and a slot-less resource management model
  • Performance – improvements in HBase, HDFS, MapReduce and compression performance
  • Usability – broader BI support, expanded API access, unified file formats & compression codecs
  • Security – scheduler ACL’s

Some items of note about this beta:

This is the first beta for CDH4. We plan to do a second beta some weeks after the first beta. The second beta will roll in updates to Apache Flume, Apache Sqoop, Hue, Apache Oozie and Apache Whirr that did not make the first beta. It will also broaden the platform support back out to our normal release matrix of Red Hat, Centos, Suse, Ubuntu and Debian. Our plan is for this second beta to have the last significant component changes before CDH goes GA.

Some CDH components are getting substantial revamps and we have transition plans for these. There is a significantly redesigned MapReduce (aka MR2) with a similar API to the old MapReduce but with new daemons, user interface and more. MR2 is part of CDH4, but we also decided it makes sense to ship with the MapReduce from CDH3 which is widely used, thoroughly debugged and stable. We will support both generations of MapReduce for the life of CDH4, which will allow customers and users to take advantage of all of the new CDH4 features while making the transition to the new MapReduce in a timeframe that makes sense for them.

The only better time to be in data mining, information retrieval, data analysis is next week. 😉

January 18, 2012

Hadoop World 2011 Videos and Slides Available

Filed under: Cloudera,Conferences,Hadoop — Patrick Durusau @ 7:51 pm

Hadoop World 2011 Videos and Slides Available

From the post:

Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, eloquently summarizes Hadoop World 2011 in these final remarks.

Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the Hadoop World presentation slides and videos for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.

Comments if you attended or suggestions of which ones to watch first?

January 10, 2012

Oracle: “Open Source isn’t all that weird” (Cloudera)

Filed under: Cloudera,Hadoop,Oracle — Patrick Durusau @ 8:12 pm

OK, maybe that’s not an exact word-for-word quotation. 😉

Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance

Ed Albanese (Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.) writes:

Summary: Oracle has selected Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager software as core technologies on the Oracle Big Data Appliance, a high performance “engineered system.” Oracle and Cloudera announced a multiyear agreement to provide CDH, Cloudera Manager, and support services in conjunction with Oracle Support for use on the Oracle Big Data Appliance.

Announced at Oracle Open World in October 2011, the Big Data Appliance was received with significant market interest. Oracle reported then that it would be released in the first half of 2012. Just 10 days into that period, Oracle has announced that the Big Data Appliance is available immediately.

The product itself is noteworthy. Oracle has combined Oracle hardware and software innovations with Cloudera technology to deliver what it calls an “engineered system.” Oracle has created several such systems over the past few years, including the Exadata, Exalogic, and Exalytics products. The Big Data Appliance combines Apache Hadoop with a purpose-built hardware platform and software that includes platform components such as Linux and Java, as well as data management technologies such as the Oracle NoSql database and Oracle integration software.

Read the post to get Ed’s take on what this will mean for both Cloudera and Oracle customers (positive).

I’m glad for Cloudera but also take this as validation of the overall Hadoop ecosystem. Not that it is appropriate for every application but where it is, it deserves serious consideration.

« Newer Posts

Powered by WordPress