Hadoop « Another Word For It

June 7, 2012

Reducing Software Highway Friction

Filed under: Hadoop,Lucene,LucidWorks,Solr — Patrick Durusau @ 2:20 pm

Lucid Imagination Search Product Offered in Windows Azure Marketplace

From the post:

Ease of use and flexibility are two key business drivers that are fueling the rapid adoption of cloud computing. The ability to disconnect an application from its supporting architecture provides a new level of business agility that has never before been possible. To ease the move towards this new realm of computing, integrated platforms have begun emerge that make cloud computing easier to adopt and leverage.

Lucid Imagination, a trusted name in Search, Discovery and Analytics, today announced that its LucidWorks Cloud product has been selected by Microsoft Corp. to be offered as a Search-as-a-Service product in Microsoft’s Windows Azure Marketplace. LucidWorks Cloud is a full cloud service version of its LucidWorks Enterprise platform. LucidWorks Cloud delivers full open source Apache Lucene/Solr community innovation with support and maintenance from the world’s leading experts in open source search. An extensible platform architected for developers, LucidWorks Cloud is the only Solr distribution that provides security, abstraction and pre-built connectors for essential enterprise data sources – along with dramatic ease of use advantages in a well-tested, integrated and documented package.

Example use cases for LucidWorks Cloud include Search-as-a-Service for websites, embedding search into SaaS product offerings, and Prototyping and developing cloud-based search-enabled applications in general.

…..

Highlights of LucidWorks Cloud Search-as-a-Service

Sign-up for a plan and start building your search application in minute

Well-organized UI makes Apache Lucene/Solr innovation easier to consume and more adaptable to constant change

Create multiple search collections and manage them independently

Configure index and query settings, fields, stop words, synonyms for each collection

Built-in support for Hadoop, Microsoft SharePoint and traditional online content types

An open connector framework is available to customize access to other data sources

REST API automates and integrates search as a service with an application

Well-instrumented dashboard for infrastructure administration, monitoring and reporting

Monitored 24×7 by Lucid Development Operations insuring minimum downtime

Source: PR Newswire (http://s.tt/1dzre)

I find this deeply encouraging.

It is a step towards a diverse but reduced friction software highway.

The user community is not well served by uniform models for data, software or UIs.

The user community can be well served by a reduced friction software highway as they move data from application to application.

Microsoft has taken a large step towards a reduced friction software highway today. And it is appreciated!

Comments Off

June 5, 2012

CDH4 and Cloudera Enterprise 4.0 Now Available

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 7:58 pm

CDH4 and Cloudera Enterprise 4.0 Now Available by Charles Zedlewski.

From the post:

I’m very pleased to announce the immediate General Availability of CDH4 and Cloudera Manager 4 (part of the Cloudera Enterprise 4.0 subscription). These releases are an exciting milestone for Cloudera customers, Cloudera users and the open source community as a whole.

Functionality

Both CDH4 and Cloudera Manager 4 are chock full of new features. Many new features will appeal to enterprises looking to move more important workloads onto the Hadoop platform. CDH4 includes high availability for the filesystem, ability to support multiple namespaces, HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations.

Other features will appeal to developers and ISV’s looking to build applications on top of CDH and / or Cloudera Manager. HBase coprocessors enable the development of new kinds of real-time applications. MapReduce2 opens up Hadoop clusters to new data processing frameworks other that MapReduce. There are new REST API’s both for the Hadoop distributed filesystem and for Cloudera Manager.

Download and install. What new features you find the most interesting?

Comments Off

June 4, 2012

Cloudera Manager 3.7.6 released!

Filed under: Cloudera,Hadoop,HDFS,MapReduce — Patrick Durusau @ 4:34 pm

Cloudera Manager 3.7.6 released! by Jon Zuanich.

Jon writes:

We are pleased to announce that Cloudera Manager 3.7.6 is now available! The most notable updates in this release are:

Support for multiple Hue service instances

Separating RPC queue and processing time metrics for HDFS

Performance tuning of the Resource Manager components

Several bug fixes and performance improvements

The detailed Cloudera Manager 3.7.6 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes

Cloudera Manager 3.7.6 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads

Only fair since I mentioned the Cray earlier that I get a post about Cloudera out today as well.

Comments Off

May 27, 2012

Facebook-class social network analysis with R and Hadoop

Filed under: Graphs,Hadoop,R — Patrick Durusau @ 7:03 pm

Facebook-class social network analysis with R and Hadoop

From the post:

In computing, social networks are traditionally represented as graphs: a connection of nodes (people), pairs of which may be connected by edges (friend relationships). Visually, the social networks can then be represented like this:

[graphic omitted]

Social network analysis often amounts to calculating the statistics on a graph like this: the number of edges (friends) connected to a particular node (person), and the distribution of the number of edges connected to nodes across the entire graph. When the graph consists of up to 10 billion elements (nodes and edges), such computations can be done on a single server with dedicated graph software like Neo4j. But bigger networks — like Facebook’s social network, which is a graph with more than 60 billion elements — require a distributed solution.

Pointer to a Marko A. Rodriguez post that describes how to use R and Hadoop on networks of scale.

Worth your time.

Comments Off

The Search Is Over: Integrating Solr and Hadoop to Simplify Big Data Analytics

Filed under: Hadoop,MapR,Solr — Patrick Durusau @ 4:27 pm

The Search Is Over: Integrating Solr and Hadoop to Simplify Big Data Analytics

From MapR Technologies.

Show of hands. How many of you can name the solution found in these slides?

😉

Slides are great for entertainment.

Solutions require more, a great deal more.

For the “more” on MapR, see: Download Hadoop Software Datasheets, Product Documentation, White Papers

Comments Off

May 25, 2012

Apache Hadoop 2.0 (Alpha) Released

Filed under: Hadoop,HDFS,MapReduce — Patrick Durusau @ 6:15 pm

Apache Hadoop 2.0 (Alpha) Released by Arun Murthy.

From the post:

As the release manager for the Apache Hadoop 2.0 release, it gives me great pleasure to share that the Apache Hadoop community has just released Apache Hadoop 2.0.0 (alpha)! While only an alpha release (read: not ready to run in production), it is still an important step forward as it represents the very first release that delivers new and important capabilities, including:

HDFS HA (manual failover)

NextGen MapReduce a.k.a YARN

HDFS Federation

Performance

Wire-compatibility for both HDFS & YARN (via protobufs)

In addition to these new capabilities, there are several planned enhancements that are on the way from the community, including HDFS Snapshots and auto-failover for HA NameNode, along with further improvements to the stability and performance with the next generation of MapReduce (YARN). There are definitely good times ahead.

Let the good times roll!

Comments Off

The Data Lifecycle, Part One: Avroizing the Enron Emails

Filed under: Avro,Data Source,Hadoop — Patrick Durusau @ 4:41 am

The Data Lifecycle, Part One: Avroizing the Enron Emails by Russell Jurney.

From the post:

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data. In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

We hope that this dataset can become a sort of common set for examples and questions, as anonymizing one’s own data in public forums can make asking questions and getting quality answers tricky and time consuming.

More information about the Enron Emails is available:

Document Classification on Enron Email Dataset

Enron Scandal on Wikipedia

UC Berkeley Enron Email Analysis

Covering the data lifecycle in any detail is a rare event.

To do so with a meaningful data set is even rarer.

You will get the maximum benefit from this series by “playing along” and posting your comments and observations.

Comments (1)

May 19, 2012

Popescu by > 1100 Words

Filed under: Hadoop — Patrick Durusau @ 6:49 pm

Possible Hadoop Trajectories

In the red corner, using 1245 words to trash Hadoop, are Michael Stonebraker and Jeremy Kepner.

In the blue corner, using 82 words to show the challengers need to follow Hadoop more closely, Alex Popescu.

Sufficient ignorance can make any technology indistinguishable from hype.

Comments Off

Apache HCatalog 0.4.0 Released

Filed under: Hadoop,HCatalog — Patrick Durusau @ 5:02 pm

Apache HCatalog 0.4.0 Released by Alan Gates.

From the post:

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

From the HCatalog documentation (0.4.0):

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.

HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

Being curious about a reference to partitions having the capacity to be multidimensional, I set off looking for information on supported data types and found:

The table shows how Pig will interpret the HCatalog data type.

HCatalog Data Type

Pig Data Type

primitives (int, long, float, double, string)

int, long, float, double, string to chararray

map (key type should be string, valuetype must be string)

map

List<any type>

bag

struct<any type fields>

tuple

The Hadoop ecosystem is evolving at a fast and furious pace!

Comments Off

May 14, 2012

Cloudera Manager 4.0 Beta released

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 8:49 am

Cloudera Manager 4.0 Beta released by Aparna Ramani

From the post:

We’re happy to announce the Beta release of Cloudera Manager 4.0.

This version of Cloudera Manager includes support for CDH4 Beta2 and several new features for both the Free edition and the Enterprise edition.

This is the last beta before the GA release.

The details are:

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support

Utilization – multiple namespaces and a slot-less resource management model

Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance

Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support

Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage

Security – HBase table & column level security and Zookeeper authentication support

Some items of note about this beta:

This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:

Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta

Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian

Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle

Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce

Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression

Not as romantic as your subject analysis activities but someone has to manage the systems that implement your analysis!

Not to mention skills here making you more attractive in any big data context.

Comments Off

May 12, 2012

CDH3 update 4 is now available

Filed under: Flume,Hadoop,HBase,MapReduce — Patrick Durusau @ 3:24 pm

CDH3 update 4 is now available by David Wang.

From the post:

We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.

First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.

In addition to the HBase updates, CDH3 update 4 also includes the latest release of Apache Flume (incubating) – version 1.1.0. A detailed description of what it brings to the table is found in a previous Cloudera blog post describing its architecture. Please note that we will continue to ship Flume 0.9.4 as well.

Comments Off

May 10, 2012

Learn Hadoop and get a paper published

Filed under: Common Crawl,Hadoop,MapReduce — Patrick Durusau @ 6:47 pm

Learn Hadoop and get a paper published by Allison Domicone.

From the post:

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.

How very cool!

Hurry, there are nineteen (19) comments already!

Comments (1)

May 5, 2012

Big Data Analytics with R and Hadoop

Filed under: BigData,Hadoop,R — Patrick Durusau @ 6:56 pm

Big Data Analytics with R and Hadoop by David Smith.

From the post:

The open-source RHadoop project makes it easier to extract data from Hadoop for analysis with R, and to run R within the nodes of the Hadoop cluster — essentially, to transform Hadoop into a massively-parallel statistical computing cluster based on R. In yesterday’s webinar (the replay of which is embedded below), Data scientist and RHadoop project lead Antonio Piccolboni introduced Hadoop and explained how to write map-reduce statements in the R language to drive the Hadoop cluster.

Something to catch up on over the weekend.

BTW, do you know the difference between “massively-parallel” and “parallel?” I would think the “Connection Machine” was “massively-parallel” for its time but that was really specialized hardware. Does “massively” mean anything now or is it just a hold over/marketing term?

Comments Off

May 2, 2012

Apache MRUnit 0.9.0-incubating has been released!

Filed under: Hadoop,MapReduce,MRUnit — Patrick Durusau @ 10:17 am

Apache MRUnit 0.9.0-incubating has been released! by Brock Noland.

The post reads in part:

We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.

The MRUnit project is quite active, 0.9.0 is our fourth release since entering the incubator and we have added 4 new committers beyond the projects initial charter! We are very interested in having new contributors and committers join the project! Please join our mailing list to find out how you can help!

The MRUnit build process has changed to produce mrunit-0.9.0-hadoop1.jar and mrunit-0.9.0-hadoop2.jar instead of mrunit-0.9.0-hadoop020.jar, mrunit-0.9.0-hadoop100.jar and mrunit-0.9.0-hadoop023.jar. The hadoop1 classifier is for all Apache Hadoop versions based off the 0.20.X line including 1.0.X. The hadoop2 classifier is for all Apache Hadoop versions based off the 0.23.X line including the unreleased 2.0.X.

Reading about JUnit recently, in part just to learn more about software testing but also thinking of what it would look like to test semantics of integration? Or with opaque mappings is that even a meaningful question? Or is the lack of meaning to that question a warning sign?

Perhaps there is no “test” for the semantics of integration. You can specify integration of data and the results may be useful or not, meaningful (in some context) or not, but the question isn’t one of testing. The question is: Are these the semantics you want for integration?

Data has no semantics until someone “looks” at it so the semantics of proposed integration have to be specified and the client/user says yes or no.

Sorry, digressed, commend MRUnit to your attention.

Comments Off

April 30, 2012

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

Filed under: BigData,Cloudera,Couchbase,Hadoop,Humor,NoSQL — Patrick Durusau @ 3:18 pm

Why Every NoSQL Deployment Should Be Paired with Hadoop (webinar)

May 9, 2012 at 10am Pacific

From the webinar registration page:

In this webinar you will hear from Dr. Amr Awadallah, Co-Founder and CTO of Cloudera and James Phillips, Co-Founder and Senior VP of Products at Couchbase.

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. In this webinar, we will explore why every NoSQL deployment should be paired with a Big Data analytics solution.

In this session you will learn:

Why NoSQL and Big Data are similar, but different

The categories of NoSQL systems, and the types of applications for which they are best suited

How Couchbase and Cloudera’s Distribution Including Apache Hadoop can be used together to build better applications

Explore real-world use cases where NoSQL and Hadoop technologies work in concert

Have you ever wanted to suggest a survey to Gartner or the technology desk at the Wall Street Journal?

Asking c-suite types at Fortune 500 firms the following questions among others:

Is there a difference between NoSQL and Big Data?
What percentage of software projects failed at your company last year?

Could go a long way to explaining the persistent and high failure rate of software projects.

Catch the webinar. Always the chance you will learn how to communicate with c-suite types. Maybe.

Comments Off

April 25, 2012

Introducing CDH4 Beta 2

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 6:27 pm

Introducing CDH4 Beta 2

Charles Zedlewski writes:

I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.

CDH4 has a great many enhancements compared to CDH3.

Availability – a high availability namenode, better job isolation, improved hard disk failure handling, and multi-version support

Utilization – multiple namespaces and a slot-less resource management model

Performance – improvements in HBase, HDFS, MapReduce, Flume and compression performance

Usability – broader BI support, expanded API options, a more responsive Hue with broader browser support

Extensibility – HBase co-processors enable developers to create new kinds of real-time big data applications, the new MapReduce resource management model enables developers to run new data processing paradigms on the same cluster resources and storage

Security – HBase table & column level security and Zookeeper authentication support

Some items of note about this beta:

This is the second (and final) beta for CDH4, and this version has all of the major component changes that we’ve planned to incorporate before the platform goes GA. The second beta:

Incorporates the Apache Flume, Hue, Apache Oozie and Apache Whirr components that did not make the first beta

Broadens the platform support back out to our normal release matrix of Red Hat, CentOS, SUSE, Ubuntu and Debian

Standardizes our release matrix of supported databases to include MySQL, PostgresSQL and Oracle

Includes a number of improvements to existing components like adding auto-failover support to HDFS’s high availability feature and adding multi-homing support to HDFS and MapReduce

Incorporates a number of fixes that were identified during the first beta period like removing a HBase performance regression

Second (and final) beta?

Sounds like time to beat and beat hard on this one.

I suspect feedback will be appreciated!

Comments Off

April 15, 2012

Constructing Case-Control Studies With Hadoop

Filed under: Bioinformatics,Biomedical,Giraph,Hadoop,Medical Informatics — Patrick Durusau @ 7:13 pm

Constructing Case-Control Studies With Hadoop by Josh Wills.

From the post:

San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend days performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Analyzing a case-control study is a problem for a statistician. Constructing a case-control study is a problem for a data scientist.

Great walk through on constructing a case-control study, including the use of the Apache Giraph library.

Comments Off

MongoDB Hadoop Connector Announced

Filed under: Hadoop,MongoDB — Patrick Durusau @ 7:13 pm

MongoDB Hadoop Connector Announced

From the post:

10gen is pleased to announce the availability of our first GA release of the MongoDB Hadoop Connector, version 1.0. This release was a long-term goal, and represents the culmination of over a year of work to bring our users a solid integration layer between their MongoDB deployments and Hadoop clusters for data processing. Available immediately, this connector supports many of the major Hadoop versions and distributions from 0.20.x and onwards.

The core feature of the Connector is to provide the ability to read MongoDB data into Hadoop MapReduce jobs, as well as writing the results of MapReduce jobs out to MongoDB. Users may choose to use MongoDB reads and writes together or separately, as best fits each use case. Our goal is to continue to build support for the components in the Hadoop ecosystem which our users find useful, based on feedback and requests.

For this initial release, we have also provided support for:

writing to MongoDB from Pig (thanks to Russell Jurney for all of his patches and improvements to this feature)

writing to MongoDB from the Flume distributed logging system

using Python to MapReduce to and from MongoDB via Hadoop Streaming.

Hadoop Streaming was one of the toughest features for the 10gen team to build. To that end, look for a more technical post on the MongoDB blog in the next week or two detailing the issues we encountered and how to utilize this feature effectively.

Question: Is anyone working on a matrix of Hadoop connectors and their capabilities? A summary resource on Hadoop connectors might be of value.

Comments (1)

April 9, 2012

First Look – Pervasive RushAnalyzer

Filed under: Hadoop,Knime,Pervasive RushAnalyzer — Patrick Durusau @ 4:31 pm

First Look – Pervasive RushAnalyzer

James Taylor writes:

Pervasive is best known for its data integration products but has recently been developing and releasing a series of products focused on analytics. RushAnalyzer is a combination of the KNIME data mining workbench (reviewed here) and Pervasive DataRush, a platform for parallelization and automatic scaling of data manipulation and analysis (reviewed here).

In the combined product, the base KNIME workbench has been extended for faster processing of larger data sets (big data) with a particular focus on use by analysts without any skills in parallelism or Hadoop programming. Pervasive has added parallelized KNIME nodes that include data access, data preparation and analytic modeling routines. KNIME’s support for extension means that KNIME’s interface is still what you use to define the modeling process but these processes can use the DataRush nodes to access and process larger volumes of data, read/write Hadoop-based data and automatically take full advantage of multi core, multi processor servers and clusters (including operations on Amazon’s EMR).

There is plenty of life left in closed source software but have you noticed the growing robustness of open source software?

I don’t know if that is attributable to the “open source” model as much as commercial enterprises that find contributing professional software skills to “open source” projects is a cost-effective way to get more programming buck for their money.

Think about it. They can hire some of the best software talent around, who then associate with more world class programming talent than any one company is likely to have in house.

And, the resulting “product” is the result of all those world class programmers and not just the investment of one vendor. (So their investment is less than if they were creating a product on their own.)

Not to mention that any government or enterprise who wants to use the software will need support contracts from, you guessed it, the vendors who contributed to the creation of the software.

And we all know that the return on service contracts is an order of magnitude or greater than the return on software products.

Support your local open source project. Your local vendor will be glad you did.

Comments Off

April 5, 2012

Pegasus

Filed under: Hadoop,Pegasus,Spectral Graph Theory — Patrick Durusau @ 3:38 pm

Pegasus

I mentioned Pegasus on September 28th of 2010. It was at version 2.0 at that time.

It is at version 2.0 today.

With all the development, including in graph projects, over the last eighteen months, I expect to be reading about new capabilities and features.

There have been new publications, Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation, but it isn’t clear to what degree those have been incorporated into Pegasus.

Comments Off

April 4, 2012

Apache Bigtop 0.3.0 (incubating) has been released

Filed under: Bigtop,Flume,Hadoop,HBase,Hive,Mahout,Oozie,Sqoop,Zookeeper — Patrick Durusau @ 2:33 pm

Apache Bigtop 0.3.0 (incubating) has been released by Roman Shaposhnik.

From the post:

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

Apache Hadoop 1.0.1

Apache Zookeeper 3.4.3

Apache HBase 0.92.0

Apache Hive 0.8.1

Apache Pig 0.9.2

Apache Mahout 0.6.1

Apache Oozie 3.1.3

Apache Sqoop 1.4.1

Apache Flume 1.0.0

Apache Whirr 0.7.0

Thoughts on what is missing from this ecosystem?

What if you moved from the company where you wrote the scripts? And they needed new scripts?

Re-write? On what basis?

Is your “big data” big enough to need “big documentation?”

Comments Off

April 3, 2012

Apache Hadoop Versions: Looking Ahead

Filed under: Hadoop — Patrick Durusau @ 4:18 pm

Apache Hadoop Versions: Looking Ahead by Aaron Myers.

From the post:

A few months ago, my colleague Charles Zedlewski wrote a great piece explaining Apache Hadoop version numbering. The post can be summed up with the following diagram:

[graphic omitted]

While Charles’s post does a great job of explaining the history of Apache Hadoop version numbering, it doesn’t help users understand where Hadoop version numbers are headed.

A must read to avoid being confused yourself about future Hadoop development.

To say nothing of trying to explain future Hadoop development to others.

What we have gained from this self-inflicted travail remains unclear.

Comments Off

Apache Sqoop Graduates from Incubator

Filed under: Database,Hadoop,Sqoop — Patrick Durusau @ 4:18 pm

Apache Sqoop Graduates from Incubator by Arvind Prabhakar.

From the post:

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

In its monthly meeting in March of 2012, the board of Apache Software Foundation (ASF) resolved to grant a Top-Level Project status to Apache Sqoop, thus graduating it from the Incubator. This is a significant milestone in the life of Sqoop, which has come a long way since its inception almost three years ago.

For moving data in and out of Hadoop, Sqoop is your friend. Drop by and say hello.

Comments Off

April 2, 2012

Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

Filed under: Behemoth,Hadoop,Solr,SolrCloud — Patrick Durusau @ 5:45 pm

Scaling Solr Indexing with SolrCloud, Hadoop and Behemoth

Grant Ingersoll writes:

We’ve been doing a lot of work at Lucid lately on scaling out Solr, so I thought I would blog about some of the things we’ve been working on recently and how it might help you handle large indexes with ease. First off, if you want a more basic approach using versions of Solr prior to what will be Solr4 and you don’t care about scaling out Solr indexing to match Hadoop or being fault tolerant, I recommend you read Indexing Files via Solr and Java MapReduce. (Note, you could also modify that code to handle these things. If you need do that, we’d be happy to help.)

Instead of doing all the extra work of making sure instances are up, etc., however, I am going to focus on using some of the new features of Solr4 (i.e. SolrCloud whose development effort has been primarily led by several of my colleagues: Yonik Seeley, Mark Miller and Sami Siren) which remove the need to figure out where to send documents when indexing, along with a convenient Hadoop-based document processing toolkit, created by Julien Nioche, called Behemoth that takes care of the need to write any Map/Reduce code and also handles things like extracting content from PDFs and Word files in a Hadoop friendly manner (think Apache Tika run in Map/Reduce) while also allowing you to output the results to things like Solr or Mahout, GATE and others as well as to annotate the intermediary results. Behemoth isn’t super sophisticated in terms of ETL (Extract-Transform-Load) capabilities, but it is lightweight, easy to extend and gets the job done on Hadoop without you having to spend time worrying about writing mappers and reducers.

If you are pushing the boundaries of your Solr 3.* installation or just want to know more about Solr4, this post is for you.

Comments Off

March 29, 2012

Why Hadoop MapReduce needs Scala

Filed under: Hadoop,MapReduce,Scalding,Scoobi — Patrick Durusau @ 6:40 pm

Why Hadoop MapReduce needs Scala – A look at Scoobi and Scalding DSLs for Hadoop by Age Mooij.

Fairly sparse slide deck but enough to get you interested enough to investigate what Scoobi and Scalding have to offer.

It may just be me but I find it easier to download the PDFs if I want to view code. The font/color just isn’t readable with online slides. Suggestion: Always allow for downloads of your slides as PDF files.

FYI:

Scoobi

Scalding

Comments (1)

March 26, 2012

Busting 10 Myths about Hadoop

Filed under: Hadoop — Patrick Durusau @ 6:36 pm

Busting 10 Myths about Hadoop – Hadoop is still misunderstood by many BI professionals by Philip Russom.

Philip lists ten myths (with explanations in his post):

Fact #1. Hadoop consists of multiple products.

Fact #2. Hadoop is open source but available from vendors, too.

Fact #3. Hadoop is an ecosystem, not a single product.

Fact #4. HDFS is a file system, not a database management system (DBMS).

Fact #5. Hive resembles SQL but is not standard SQL.

Fact #6. Hadoop and MapReduce are related but don’t require each other.

Fact #7. MapReduce provides control for analytics, not analytics per se.

Fact #8. Hadoop is about data diversity, not just data volume.

Fact #9. Hadoop complements a DW; it’s rarely a replacement.

Fact #10. Hadoop enables many types of analytics, not just Web analytics.

If you are unclear on any of these points, please see Philip’s post. (And/or sign up for Hadoop training.)

Comments Off

Measuring User Retention with Hadoop and Hive

Filed under: Hadoop,Hive,Marketing — Patrick Durusau @ 6:35 pm

Measuring User Retention with Hadoop and Hive by Daniel Russo.

From the post:

The Hadoop ecosystem is comprised of numerous technologies that can work together to provide a powerful and scalable mechanism for analyzing and deriving insight from large quantities of data.

In an effort to showcase the flexibility and raw power of queries that can be performed over large datasets stored in Hadoop, this post is written to demonstrate an example use case. The specific goal is to produce data related to user retention, an important metric for all product companies to analyze and understand.

Motivation: Why User Retention?

Broadly speaking, when equipped with the appropriate tools and data, we can enable our team and our customers to better understand the factors that drive user engagement and to ultimately make decisions that deliver better products to market.

User retention measures speak to the core of product quality by answering a crucial question about how the product resonates with users. In the case of apps (mobile or otherwise), that question is: “how many days does it take for users to stop using (or uninstall) the app?”.

Pinch Media (now Flurry) delivered a formative presentation early in the AppStore’s history. Among numerous insights collected from their dataset was the following slide, which detailed patterns in user retention across all apps implementing their tracking SDK:

I mention this example because:

User retention is the measure of an app’s success or failure.*
Hadoop and Hive skill sets are good ones pick up.

* I have a pronounced fondness for requirements and the documenting of the same. Others prefer unit/user/interface/final tests. Still others prefer formal proofs of “correctness.” All pale beside the test of “user retention.” If users keep using an application, what other measure would be meaningful?

Comments Off

March 24, 2012

Cloudera Manager 3.7.4 released! (spurious alerts?)

Filed under: Cloudera,Hadoop — Patrick Durusau @ 7:36 pm

Cloudera Manager 3.7.4 released! by Bala Venkatrao.

From the post:

We are pleased to announce that Cloudera Manager 3.7.4 is now available! The most notable updates in this release are:

A fixed memory leak in supervisord

Compatibility with a scheduled refresh of CDH3u3

Significant improvements to the alerting functionality, and the rate of ‘false positive alerts’

Support for several new multi-homing features

Updates to the default heap sizes for the management daemons (these have been increased).

The detailed Cloudera Manager 3.7.4 release notes are available at: https://ccp.cloudera.com/display/ENT/Cloudera+Manager+3.7.x+Release+Notes

Cloudera Manager 3.7.4 is available to download from: https://ccp.cloudera.com/display/SUPPORT/Downloads

I admit to being curious (or is that suspicious?) and so when I read ‘false positive alerts’, I had to consult the release notes:

Some of the alerting behaviors have changed, including selected default settings. This has streamlined some of the alerting behavior and avoids spurious alerts in certain situations. These changes include:

The default alert values have been changed so that summary level alerts are disabled by default, to avoid unnecessary email alerts every time an individual health check alert email is sent.

The default behavior for DataNodes and TaskTrackers is now to never emit alerts.

The “Job Failure Ratio Thresholds” parameter has been disabled by default. The utility of this test very much depends on how the cluster is used. This parameter and the “Job Failure Ratio Minimum Failing Jobs” parameters can be used to alert when jobs fail.

So, the alerts in question were not spurious alerts but alerts users of Cloudera Manager could not correctly configure?

Question: Can your Cloudera Manager users correctly configure alerts? (That could be a good Cloudera installation interview question. Use a machine disconnected from your network and the Internet for testing.)

Comments Off

Apache HBase 0.92.1 now available

Filed under: Cloudera,Hadoop,HBase — Patrick Durusau @ 7:35 pm

Apache HBase 0.92.1 now available by Shaneal Manek

From the post:

Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It’s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228.

Apache HBase 0.92.1 is a bug fix release covering 61 issues – including 6 blockers and 6 critical issues, such as:

HBASE-5121 which ensures scans work properly while major compactions are occurring

Several fixes that prevent crashes in edge-case situations (HBASE-5279, HBASE-4890, and HBASE-5415)

Fixing the build system so the release includes security sources and Javadocs

HBASE-5267 which fixes some configuration problems with the slab cache

Comments Off

March 20, 2012

Authorization and Authentication In Hadoop

Filed under: Hadoop — Patrick Durusau @ 3:53 pm

Authorization and Authentication In Hadoop by Jon Natkins.

From the post:

One of the more confusing topics in Hadoop is how authorization and authentication work in the system. The first and most important thing to recognize is the subtle, yet extremely important, differentiation between authorization and authentication, so let’s define these terms first:

Authentication is the process of determining whether someone is who they claim to be.

Authorization is the function of specifying access rights to resources.

In simpler terms, authentication is a way of proving who I am, and authorization is a way of determining what I can do.

Let me see if I can summarize the authentication part: If you are responsible for the Hadoop cluster and unauthenticated users can access it, you need to have a backup job.

Hadoop doesn’t have authentication enabled by default but authentication for access to the cluster could be performed by some other mechanism. Such as access to the network where the cluster resides, etc.

There are any number of ways to do authentication but to lack authentication to a network asset is a recipe for being fired upon its discovery.

Authorization regulates access and usage of cluster assets.

Here’s the test for authentication and authorization on your mission critical Hadoop cluster. While sitting in front of your cluster admin’s desk, ask for a copy of the authentication and authorization policies and settings for your cluster. If they can’t send it to a printer, you need another cluster admin. It is really that simple.

Comments Off

« Newer Posts — Older Posts »

HCatalog Data Type	Pig Data Type
primitives (int, long, float, double, string)	int, long, float, double, string to chararray
map (key type should be string, valuetype must be string)	map
List<any type>	bag
struct<any type fields>	tuple

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 7, 2012

June 5, 2012

June 4, 2012

May 27, 2012

May 25, 2012

May 19, 2012

May 14, 2012

May 12, 2012

May 10, 2012

May 5, 2012

May 2, 2012

April 30, 2012

April 25, 2012

April 15, 2012

April 9, 2012

April 5, 2012

April 4, 2012

April 3, 2012

April 2, 2012

March 29, 2012

March 26, 2012

March 24, 2012

March 20, 2012