Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 4, 2012

How to Contribute to Apache Hadoop Projects, in 24 Minutes

Filed under: Hadoop,Programming — Patrick Durusau @ 11:54 am

How to Contribute to Apache Hadoop Projects, in 24 Minutes by Justin Kestelyn.

From the webpage:

So, you want to report a bug, propose a new feature, or contribute code or doc to Apache Hadoop (or a related project), but you don’t know what to do and where to start? Don’t worry, you’re not alone.

Let us help: in this 24-minute screencast, Clouderan Jeff Bean (@jwfbean) offers a step-by-step tutorial that explains why and how to contribute. Apache JIRA ninjas need not view, but anyone else with slight (or less) familiarity with that curious beast will find this information very helpful.

I have mentioned a number of Hadoop ecosystem projects and this is a nice overview of how to contribute to those projects.

Other than the examples, the advice is generally useful for any Apache project (or other projects for that matter).

I first saw this in a tweet from Cloudera.

December 3, 2012

Cloudera – Videos from Strata + Hadoop World 2012

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 7:20 pm

Cloudera – Videos from Strata + Hadoop World 2012

The link is to the main resources page, where you can find many other videos and other materials.

If you want Strata + Hadoop World 2012 videos specifically, search on Hadoop World 2012.

As of today, that pulls up 41 entries. Should be enough to keep you occupied for a day or so. 😉

December 1, 2012

MOA Massively Online Analysis

Filed under: BigData,Data,Hadoop,Machine Learning,S4,Storm,Stream Analytics — Patrick Durusau @ 8:02 pm

MOA Massively Online Analysis : Real Time Analytics for Data Streams

From the homepage:

What is MOA?

MOA is an open source framework for data stream mining. It includes a collection of machine learning algorithms (classification, regression, and clustering) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems.

What can MOA do for you?

MOA performs BIG DATA stream mining in real time, and large scale machine learning. MOA can be easily used with Hadoop, S4 or Storm, and extended with new mining algorithms, and new stream generators or evaluation measures. The goal is to provide a benchmark suite for the stream mining community. Details.

Short tutorials and a manual are available. Enough to get started but you will need additional resources on machine learning if it isn’t already familiar.

A small niggle about documentation. Many projects have files named “tutorial” or in this case “Tutorial1,” or “Manual.” Those files are easier to discover/save, if the project name, version(?), is prepended to tutorial or manual. Thus “Moa-2012-08-tutorial1” or “Moa-2012-08-manual.”

If data streams are in your present or future, definitely worth a look.

November 28, 2012

Mortar [Public Launch, Python and Hadoop]

Filed under: Hadoop,Mortar,Usability — Patrick Durusau @ 9:59 am

Announcing our public launch

From the post:

Last week, we announced our $1.8 million fundraising. For those of you who follow big data startups, our blog post probably felt…underwhelming. Startups typically come out and make a huge publicity splash, jam-packed with buzzwords and vision galore. While we feel very fortunate to have what we need to help us grow, we know that VC funding is merely a means, and not an end.

But now you get to see us get really excited, because Mortar’s Hadoop PaaS and open source framework for big data is now publicly available. This means if you want to try it, you can activate your trial right now on our site without having to talk to anyone (unless you want to!).

You can get started on Mortar using Web Projects (using Mortar entirely online through the browser) or Git Projects (using Mortar locally on your own machine with the Mortar development framework). You can see more info about both here.

All trial accounts come with our full Hadoop PaaS, unlimited use of the Mortar framework, our site, and dev tools, and 10 free Hadoop node-hours. (You can get another 15 free node-hours per month and additional support at no cost by simply adding your credit card to the account.)

Mortar accepts PIG scripts and “real Python.” So you can use your favourite Python libraries with Hadoop.

I don’t know if there is any truth to the rumor that Mortar supports Python because Lars Marius Garshol and Steve Newcomb use it. So don’t ask me.

I first saw this in a tweet by David Fauth.

November 23, 2012

Combining Neo4J and Hadoop (part I)

Filed under: Hadoop,Neo4j — Patrick Durusau @ 11:29 am

Combining Neo4J and Hadoop (part I) by Kris Geusebroek.

From the post:

Why combine these two different things.

Hadoop is good for data crunching, but the end-results in flat files don’t present well to the customer, also it’s hard to visualize your network data in excel.

Neo4J is perfect for working with our networked data. We use it a lot when visualizing our different sets of data.
So we prepare our dataset with Hadoop and import it into Neo4J, the graph database, to be able to query and visualize the data.
We have a lot of different ways we want to look at our dataset so we tend to create a new extract of the data with some new properties to look at every few days.

This blog is about how we combined Hadoop and Neo4J and describes the phases we went trough in our search for the optimal solution.

Mostly covers slow load speeds into Neo4j and attempts to improve it.

A future post will cover use of a distributed batchimporter process.

I first saw this at DZone.

November 20, 2012

Hadoop on Azure : Introduction

Filed under: Azure Marketplace,Hadoop — Patrick Durusau @ 7:28 pm

Hadoop on Azure : Introduction by BrunoTerkaly.

From the post:

I am in complete awe on how this technology is resonating with today’s developers. If I invite developers for an evening event, Big Data is always a sellout.

This particular post is about getting everyone up to speed about what Hadoop is at a high level.

Big data is a technology that manages voluminous amount of unstructured and semi-structured data.

Due to its size and semi-structured nature, it is inappropriate for relational databases for analysis.

Big data is generally in the petabytes and exabytes of data.

A very high level view but a series to watch as the details emerge on using Hadoop on Azure.

November 19, 2012

The “Ask Bigger Questions” Contest!

Filed under: Cloudera,Contest,Hadoop — Patrick Durusau @ 7:17 pm

The “Ask Bigger Questions” Contest! by Ryan Goldman. (Deadline, Feb. 1 2013)

From the post:

Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.

Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.

Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?

The most compelling stories chosen from all entrants will receive prizes like Amazon gift cards, discounted Cloudera University training, autographed copies of Hadoop books from O’Reilly Media, and Cloudera swag. We may even turn your story into a case study!

Sign up to participate here. Submissions must be received by Friday, Feb. 1, 2013 to qualify for a prize.

A good marketing technique that might bear imitation.

Don’t have to seek out success stories. Incentives for people to bring them to you.

You get good marketing material that is likely to resonate with other users.

Something to think about.

BioInformatics: A Data Deluge with Hadoop to the Rescue

Filed under: Bioinformatics,Cloudera,Hadoop,Impala — Patrick Durusau @ 4:10 pm

BioInformatics: A Data Deluge with Hadoop to the Rescue by Marty Lurie.

From the post:

Cloudera Cofounder and Chief Scientist Jeff Hammerbacher is leading a revolutionary project with Mount Sinai School of Medicine to apply the power of Cloudera’s Big Data platform to critical problems in predicting and understanding the process and treatment of disease.

“We are at the cutting edge of disease prevention and treatment, and the work that we will do together will reshape the landscape of our field,” said Dennis S. Charney, MD, Anne and Joel Ehrenkranz Dean, Mount Sinai School of Medicine and Executive Vice President for Academic Affairs, The Mount Sinai Medical Center. “Mount Sinai is thrilled to join minds with Cloudera.” (Please see http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1747809 for more details.)

Cloudera is active in many other areas of BioInformatics. Due to Cloudera’s market leadership in Big Data, many DNA mapping programs have specific installation instructions for CDH (Cloudera’s 100% open-source, enterprise-ready distribution of Hadoop and related projects). But rather than just tell you about Cloudera let’s do a worked example of BioInformatics data – specifically FAERS.

A sponsored piece by Cloudera but walks you through using Impala with the FDA data on adverse drug reactions.

Demonstrates getting started with Impala isn’t hard. Which is true.

What’s lacking is a measure of the difficulty of good results.

Any old result, good or bad, probably isn’t of interest to most users.

November 15, 2012

Cloudera Glossary

Filed under: Cloudera,Hadoop — Patrick Durusau @ 3:47 pm

Cloudera Glossary

A goodly collection of terms used with Cloudera (Hadoop and related) technology.

I have a weakness for dictionaries, lexicons, grammars and the like so your mileage may vary.

Includes links to additional resources.

November 14, 2012

Kiji Project [Framework for HBase]

Filed under: Entities,Hadoop,HBase,KIji Project — Patrick Durusau @ 1:22 pm

Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase by Aaron Kimball.

From the post:

Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.

While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.

Today, we are proud to announce the launch of the Kiji project (www.kiji.org), as well as the first Kiji component: KijiSchema. The Kiji project was developed to host a suite of open source components built on top of Apache HBase and Apache Hadoop, that makes it easier for developers to:

  1. Use HBase as a real-time data storage and serving layer for applications
  2. Maximize HBase performance using data management best practices
  3. Get started building data applications quickly with easy startup and configuration

Kiji is open source and licensed under the Apache 2.0 license. The Kiji project is modularized into separate components to simplify adoption and encourage clean separation of functionality. Our approach emphasizes interoperability with other systems, leveraging the open source HBase, Avro and MapReduce projects, enabling you to easily fit Kiji into your development process and applications.

KijiSchema: Schema Management for HBase

The first component within the Kiji project is KijiSchema, which provides layout and schema management on top of HBase. KijiSchema gives developers the ability to easily store both structured and unstructured data within HBase using Avro serialization. It supports a variety of rich schema features, including complex, compound data types, HBase column key and time-series indexing, as well cell-level evolving schemas that dynamically encode version information.

KijiSchema promotes the use of entity-centric data modeling, where all information about a given entity (user, mobile device, ad, product, etc.), including dimensional and transaction data, is encoded within the same row. This approach is particularly valuable for user-based analytics such as targeting, recommendations, and personalization.

This looks important!

Reading further about their “entiity-centric” approach:

Entity-Centric Data Model

KijiSchema’s data model is entity-centric. Each row typically holds information about a single entity in your information scheme. As an example, a consumer e-commerce web site may have a row representing each user of their site. The entity-centric data model enables easier analysis of individual entities. For example, to recommend products to a user, information such as the user’s past purchases, previously viewed items, search queries, etc. all need to be brought together. The entity-centric model stores all of these attributes of the user in the same row, allowing for efficient access to relevant information.

The entity-centric data model stands in comparison to a more typical log-based approach to data collection. Many MapReduce systems import log files for analysis. Logs are action-centric; each action performed by a user (adding an item to a shopping cart, checking out, performing a search, viewing a product) generates a new log entry. Collecting all the data required for a per-user analysis thus requires a scan of many logs. The entity-centric model is a “pivoted” form of this same information. By pivoting the information as the data is loaded into KijiSchema, later analysis can be run more efficiently, either in a MapReduce job operating over all users, or in a more narrowly-targeted fashion if individual rows require further computation.

I’m already convinced about a single representative for an entity. 😉

Need to work through the documentation on capturing diverse information about a single entity in one row.

I suspect that the structures that capture data aren’t entities for purposes of this model.

Still, will be an interesting exploration.

Cloudera Impala – Fast, Interactive Queries with Hadoop

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 5:50 am

Cloudera Impala – Fast, Interactive Queries with Hadoop by Istvan Szegedi.

From the post:

As discussed in the previous post about Twitter’s Storm, Hadoop is a batch oriented solution that has a lack of support for ad-hoc, real-time queries. Many of the players in Big Data have realised the need for fast, interactive queries besides the traditional Hadooop approach. Cloudera, one the key solution vendors in Big Data/Hadoop domain has just recently launched Cloudera Impala that addresses this gap.

As Cloudera Engineering team descibed in ther blog, their work was inspired by Google Dremel paper which is also the basis for Google BigQuery. Cloudera Impala provides a HiveQL-like query language for wide variety of SELECT statements with WHERE, GROUP BY, HAVING clauses, with ORDER BY – though currently LIMIT is mandatory with ORDER BY -, joins (LEFT, RIGTH, FULL, OUTER, INNER), UNION ALL, external tables, etc. It also supports arithmetic and logical operators and Hive built-in functions such as COUNT, SUM, LIKE, IN or BETWEEN. It can access data stored on HDFS but it does not use mapreduce, instead it is based on its own distributed query engine.

The current Impala release (Impala 1.0beta) does not support DDL statements (CREATE, ALTER, DROP TABLE), all the table creation/modification/deletion functions have to be executed via Hive and then refreshed in Impala shell.

Cloudera Impala is open-source under Apache Licence, the code can be retrieved from Github. Its components are written in C++, Java and Python.

Will get you off to a good start with Impala.

November 9, 2012

Twitter Flies by Hadoop on Search Quest

Filed under: Hadoop,HDFS,Tweets — Patrick Durusau @ 4:38 pm

Twitter Flies by Hadoop on Search Quest by Ian Armas Foster.

From the post:

People who use Twitter may not give a second thought to the search bar at the top of the page. It’s pretty basic, right? You type something into the nifty little box and, like the marvel of efficient search that it is, it offers suggestions for things the user might wish to search during the typing process.

On the surface, it operates like any decent search engine. Except, of course, this is Twitter we’re talking about. There is no basic functionality at the core here. As it turns out, a significant amount of effort went into designing the Twitter search suggestion engine and the network is still just getting started refining this engine.

A recent Twitter-published scientific paper tells the tale of Twitter’s journey through their previously existing Hadoop infrastructure to a custom combined infrastructure. This connects the HDFS to a frontend cache (to deal with queries and responses) and a backend (which houses algorithms that rank relevance).

The latency of the Hadoop solution was too high.

Makes me think about topic map authoring with a real time “merging” interface. One that displays the results of a current topic, association or occurrence that is being authored on the map.

Or at least the option to choose to see such a display with some reasonable response time.

Matching MDM with Hadoop: Think of the Possibilities [Danger! Danger! Will Robinson!]

Filed under: Hadoop,MDM — Patrick Durusau @ 4:08 pm

Matching MDM with Hadoop: Think of the Possibilities by Loraine Lawson.

From the post:

I’m always curious about use cases with Hadoop, mostly because I feel there’s a lot of unexplored potential still.

For example, could Hadoop make it easier to achieve master data management’s goal of a “single version of the customer” from large datasets? During a recent interview with IT Business Edge, Ciaran Dynes said the idea has a lot of potential, especially when you consider that customer records from, say, banks can have up to 150 different attributes.

Hadoop can allow you to explore as many dimensions and attributes you want, he explained.

“They have every flavor of your address and duplications of your address, for that matter, in that same record,” Dynes, Talend’s senior director of product management and product marketing, said. “What Hadoop allows you to consider is, ‘Let’s put it all up there for the problems that they’re presenting like a single version of the customer.’”

Dynes also thinks we’re still exploring the edges of Hadoop’s potential to change information management.

“We genuinely think it is going to probably have a bigger effect on the industry than the cloud,” he said. “Its opening up possibilities that we didn’t think we could look at in terms of analytics, would be one thing. But I think there’s so many applications for this technology and so many ways of thinking about how you integrate your entire company that I do think it’ll have a profound effect on the industry.”

When I hear the phrase “…single version of the customer…” I think of David Loshin’s “A Good Example of Semantic Inconsistency” (my pointer with comments)

David illustrates that “customer” is a term fraught were complexity.

Having a bigger gun doesn’t make a moving target easier to hit.

Can do more damage unintentionally than good.

Why not RAID-0? It’s about Time and Snowflakes

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 11:51 am

Why not RAID-0? It’s about Time and Snowflakes by Steve Loughran.

From the post:

A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”

Steve uses empirical data on disk storage to explain why to avoid RAID-0 when using Hadoop.

As nice a summary as you are likely to find.

November 8, 2012

What’s New in Apache Sqoop 1.4.2

Filed under: Hadoop,Sqoop — Patrick Durusau @ 6:53 pm

What’s New in Apache Sqoop 1.4.2 by by Jarek Jarcec Cecho.

Jarek highlights the key features and fixes of this release of Apache Sqoop (its first as a top level project).

Those include:

  • Hadoop 2.0.0 Support
  • Compatibility with Old Connectors
  • Incremental Imports of Free Form Queries
  • Implicit and Explicit Connector Pickup Improvements
  • Exporting Only a Subset of Columns
  • Verbose Logging
  • Hive Imports

From the Apache Sqoop homepage:

Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Cascading 2.1

Filed under: Cascading,Hadoop — Patrick Durusau @ 6:41 pm

Cascading 2.1

Cascading 2.1 was released October 30, 2012. (Apologies for missing the release.)

If you don’t know Cascading, it self describes as:

Big Data Application Development

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.

Data Processing API

At it’s core, Cascading is a rich Java API for defining complex data flows and creating sophisticated data oriented frameworks. These frameworks can be Maven compatible libraries, or Domain Specific Languages (DSLs) for scripting.

Data Integration API

Cascading allows developers to create and test rich functionality before tackling complex integration problems. Thus integration points can be developed and tested before plugging them into a production data flow.

Process Scheduler API

The Process Scheduler coupled with the Riffle lifecycle annotations allows Cascading to schedule unit of work from any third-party application.

Enterprise Development

Cascading was designed to fit into any Enterprise Java development environment. With its clear distinction between “data processing” and “data integration”, its clean Java API, and JUnit testing framework, Cascading can be easily tested at any scale. Even the core Cascading development team runs 1,500 tests daily on an Continuous Integration server and deploys all the tested Java libraries into our own public Maven repository, conjars.org.

Data Science

Because Cascading is Java based, it naturally fits into all of the JVM based languages available. Notably Scala, Clojure, Jruby, Jython, and Groovy. Within a many of these languages, scripting and query languages have been created by the Cascading community to simplify ad-hoc and production ready analytics and machine learning applications. See the extensions page for more information.

Homepage link? http://www.cascading.org/

100% Big Data 0% Hadoop 0% Java

Filed under: BigData,Erjang,Hadoop,Python — Patrick Durusau @ 3:36 pm

100% Big Data 0% Hadoop 0% Java by Pavlo Baron.

If you are guessing Python and Erlang, take another cookie!

Not a lot of details (it’s slides) but take a look at: https://github.com/pavlobaron, Disco in particular.

Hadoop can only be improved by insights gained from alternative approaches.

Recall we recently took the “one answer only” road in databases not long ago. Yes?

November 7, 2012

Impala: Real-time Queries in Hadoop [Recorded Webinar]

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 4:05 pm

Impala: Real-time Queries in Hadoop

From the description:

Learn how Cloudera Impala empowers you to:

  1. Perform interactive, real-time analysis directly on source data stored in Hadoop
  2. Interact with data in HDFS and HBase at the “speed of thought”
  3. Reduce data movement between systems & eliminate double storage

You can also grab the slides here

.Almost fifty-nine minutes.

Speedup on Hive over MapReduce reported to be 4-30X faster.

I’m not sure about #2 above but then I lack a skull-jack. Maybe this next Christmas. 😉

November 5, 2012

The Week in Big Data Research [November 3, 2012]

Filed under: BigData,Hadoop,MapReduce — Patrick Durusau @ 11:22 am

The Week in Big Data Research from Datanami.

A new feature from Datanami that highlights academic research on big data.

In last Friday’s post you will find:

MapReduce-Based Data Stream Processing over Large History Data

Abstract:

With the development of Internet of Things applications based on sensor data, how to process high speed data stream over large scale history data brings a new challenge. This paper proposes a new programming model RTMR, which improves the real-time capability of traditional batch processing based MapReduce by preprocessing and caching, along with pipelining and localizing. Furthermore, to adapt the topologies to application characteristics and cluster environments, a model analysis based RTMR cluster constructing method is proposed. The benchmark built on the urban vehicle monitoring system shows RTMR can provide the real-time capability and scalability for data stream processing over large scale data.


Mastiff: A MapReduce-based System for Time-Based Big Data Analytics

Abstract:

Existing MapReduce-based warehousing systems are not specially optimized for time-based big data analysis applications. Such applications have two characteristics: 1) data are continuously generated and are required to be stored persistently for a long period of time, 2) applications usually process data in some time period so that typical queries use time-related predicates. Time-based big data analytics requires both high data loading speed and high query execution performance. However, existing systems including current MapReduce-based solutions do not solve this problem well because the two requirements are contradictory. We have implemented a MapReduce-based system, called Mastiff, which provides a solution to achieve both high data loading speed and high query performance. Mastiff exploits a systematic combination of a column group store structure and a lightweight helper structure. Furthermore, Mastiff uses an optimized table scan method and a column-based query execution engine to boost query performance. Based on extensive experiments results with diverse workloads, we will show that Mastiff can significantly outperform existing systems including Hive, HadoopDB, and GridSQL.


Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets

Abstract:

Scientific datasets, such as HDF5 and PnetCDF, have been used widely in many scientific applications. These data formats and libraries provide essential support for data analysis in scientific discovery and innovations. In this research, we present an approach to boost data analysis, namely Fast Analysis with Statistical Metadata (FASM), via data sub setting and integrating a small amount of statistics into datasets. We discuss how the FASM can improve data analysis performance. It is currently evaluated with the PnetCDF on synthetic and real data, but can also be implemented in other libraries. The FASM can potentially lead to a new dataset design and can have an impact on data analysis.


MapReduce Performance Evaluation on a Private HPC Cloud

Abstract:

The convergence of accessible cloud computing resources and big data trends have introduced unprecedented opportunities for scientific computing and discovery. However, HPC cloud users face many challenges when selecting valid HPC configurations. In this paper, we report a set of performance evaluations of data intensive benchmarks on a private HPC cloud to help with the selection of such configurations. More precisely, we study the effect of virtual machines core-count on the performance of 3 benchmarks widely used by the MapReduce community. We notice that depending on the computation to communication ratios of the studied applications, using higher core-counts virtual machines do not always lead to higher performance for data-intensive applications.


I manage to visit Datanami once or twice a day. Usually not as long as I should. 😉 Visit, I think you will be pleasantly surprised.

PS: You will be seeing some of these articles in separate posts. Thought the cutting/bleeding edge types would like notice sooner rather than latter.

October 31, 2012

One To Watch: Apache Crunch

Filed under: Apache Crunch,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:37 pm

One To Watch: Apache Crunch by Chris Mayer.

From the post:

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and “even fun” (if this Cloudera blog post is to be believed).

That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.

A project to definitely keep in sight.

October 30, 2012

7 Symptoms you are turning into a Hadoop nerd

Filed under: Hadoop,Humor — Patrick Durusau @ 1:29 pm

7 Symptoms you are turning into a Hadoop nerd

Very funny!

Although, as you imagine, my answer for #2 differs. 😉

Enjoy!

October 29, 2012

Top 5 Challenges for Hadoop MapReduce… [But Semantics Isn’t One Of Them]

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:46 pm

Top 5 Challenges for Hadoop MapReduce in the Enterprise

IBM sponsored content at Datanami.com lists these challenges for Hadoop MapReduce in enterprise settings:

  • Lack of performance and scalability….
  • Lack of flexible resource management….
  • Lack of application deployment support….
  • Lack of quality of service assurance….
  • Lack of multiple data source support….

Who would know enterprise requirements better than IBM? They have been in the enterprise business long enough to be an enterprise themselves.

If IBM says these are the top 5 challenges for Hadoop MapReduce in enterprises, it’s a good list.

But I don’t see “semantics” in that list.

Do you?

Semantics make it possible to combine data from different sources, process it and report a useful answer.

Or rather understanding data semantics and mapping between them makes a useful answer possible.

Try pushing data from different sources together without understanding and mapping their semantics.

It won’t take long for you to decide which way you prefer.

If semantics are critical to any data operation, including combining data from diverse sources, why do they get so little attention?

Doubt your IBM representative would know but you could ask them, while trying out the IBM solution to the “top 5 challenges for Hadoop MapReduce:”

How you should discover and then map the semantics of diverse data sources?

Having mapped them once, can you re-use that mapping for future projects with the IBM solution?

October 27, 2012

Designing good MapReduce algorithms

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 6:28 pm

Designing good MapReduce algorithms by Jeffrey D. Ullman.

From the introduction:

If you are familiar with “big data,” you are probably familiar with the MapReduce approach to implementing parallelism on computing clusters [1]. A cluster consists of many compute nodes, which are processors with their associated memory and disks. The compute nodes are connected by Ethernet or switches so they can pass data from node to node.

Like any other programming model, MapReduce needs an algorithm-design theory. The theory is not just the theory of parallel algorithms—MapReduce requires we coordinate parallel processes in a very specific way. A MapReduce job consists of two functions written by the programmer, plus some magic that happens in the middle:

  1. The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements.
  2. The system sorts the key-value pairs by key, and for each key creates a pair consisting of the key itself and a list of all the values associated with that key.
  3. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values. The output of the entire MapReduce job is what results from the application of the Reduce function to each key and its list.

When we execute a MapReduce job on a system like Hadoop [2], some number of Map tasks and some number of Reduce tasks are created. Each Map task is responsible for applying the Map function to some subset of the input elements, and each Reduce task is responsible for applying the Reduce function to some number of keys and their associated lists of values. The arrangement of tasks and the key-value pairs that communicate between them is suggested in Figure 1. Since the Map tasks can be executed in parallel and the Reduce tasks can be executed in parallel, we can obtain an almost unlimited degree of parallelism—provided there are many compute nodes for executing the tasks, there are many keys, and no one key has an unusually long list of values

A very important feature of the Map-Reduce form of parallelism is that tasks have the blocking property [3]; that is, no Map or Reduce task delivers any output until it has finished all its work. As a result, if a hardware or software failure occurs in the middle of a MapReduce job, the system has only to restart the Map or Reduce tasks that were located at the failed compute node. The blocking property of tasks is essential to avoid restart of a job whenever there is a failure of any kind. Since Map-Reduce is often used for jobs that require hours on thousands of compute nodes, the probability of at least one failure is high, and without the blocking property large jobs would never finish.

There is much more to the technology of MapReduce. You may wish to consult, a free online text that covers MapReduce and a number of its applications [4].

Warning: This article may change your interest in the design of MapReduce algorithms.

Ullman’s stories of algorithm tradeoffs provide motivation to evaluate (or reevaluate) your own design tradeoffs.

October 25, 2012

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight…(Hortonworks Inside)

Filed under: Hadoop,HDInsight,Hortonworks,Microsoft — Patrick Durusau @ 4:02 pm

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside) by Russell Jurney.

From the post:

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

My thoughts on reading Russell’s post:

  • A live product demo that did not fail? Really?
  • Is that tatoo copyrighted?
  • Oh, yes, +1!, big data has become real for millions of users.

How’s that for a big data book, tutorial, consulting, semantic market explosion?

Why Microsoft is committed to Hadoop and Hortonworks

Filed under: BigData,Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 2:53 pm

Why Microsoft is committed to Hadoop and Hortonworks (a buest post at Hortonworks by Microsoft’s Dave Campbell).

From the post:

Last February at Strata Conference in Santa Clara we shared Microsoft’s progress on Big Data, specifically working to broaden the adoption of Hadoop with the simplicity and manageability of Windows and enabling customers to easily derive insights from their structured and unstructured data through familiar tools like Excel.

Hortonworks is a recognized pioneer in the Hadoop Community and a leading contributor to the Apache Hadoop project, and that’s why we’re excited to announce our expanded partnership with Hortonworks to give customers access to an enterprise-ready distribution of Hadoop that is 100 percent compatible with Windows Server and Windows Azure. To provide customers with access to this Hadoop compatibility, yesterday we also released new previews of Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service, our Hadoop-based solutions for Windows Server and Windows Azure.

With this expanded partnership, the Hadoop community will reap the following benefits of Hadoop on Windows:

  • Insights to all users from all data:….
  • Enterprise-ready Hadoop with HDInsight:….
  • Simplicity of Windows for Hadoop:….
  • Extend your data warehouse with Hadoop:….
  • Seamless Scale and Elasticity of the Cloud:….

This is a very exciting milestone, and we hope you’ll join us for the ride as we continue partnering with Hortonworks to democratize big data. Download HDInsight today at Microsoft.com/BigData.

See Dave’s post for the details on “benefits of Hadoop on Windows” and then like the man says:

Download HDInsight today at Microsoft.com/BigData.

Enabling Big Data Insight for Millions of Windows Developers [Your Target Audience?]

Filed under: Azure Marketplace,BigData,Hadoop,Hortonworks,Microsoft — Patrick Durusau @ 2:39 pm

Enabling Big Data Insight for Millions of Windows Developers by Shaun Connolly.

From the post:

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

You don’t have to wonder long what Shaun is reacting to:

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

Enabling big data insight isn’t the same as capturing those insights for later use or re-use.

May just be me, but that sounds like a great opportunity for topic maps.

Bringing semantics to millions of Windows developers that is.

Cloudera’s Impala and the Semantic “Mosh Pit”

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 4:30 am

Cloudera’s Impala tool binds Hadoop with business intelligence apps by Christina Farr.

From the post:

In traditional circles, Hadoop is viewed as a bright but unruly problem child.

Indeed, it is still in the nascent stages of development. However the scores of “big data” startups that leverage Hadoop will tell you that it is here to stay.

Cloudera, the venture-backed startup that ushered the mainstream deployment of Hadoop, has unveiled a new technology at the Hadoop World, the data-focused conference in New York.

Its new product, known as “Impala”, addresses many of the concerns that large enterprises still have about Hadoop, namely that it does not integrate well with traditional business intelligence applications.

“We have heard this criticism,” said Charles Zedlewski, Cloudera’s VP of Product in a phone interview with VentureBeat. “That’s why we decided to do something about it,” he said.

Impala enables its users to store vast volumes of unwieldy data and run queries in HBase, Hadoop’s NoSQL database. What’s interesting is that it is built to maximise speed: it runs on top of Hadoop storage, but speaks to SQL and works with pre-existing drivers.

Legacy data is a well known concept.

Are we approaching the point of legacy applications? Applications that are too widely/deeply embedded in IT infrastructure to be replaced?

Or at least not replaced quickly?

The semantics of legacy data are known to be fair game for topic maps. Do the semantics of legacy applications offer the same possibilities?

Mapping the semantics of “legacy” applications, their ancestors and descendants, data, legacy and otherwise, results in a semantic mosh pit.

Some strategies for a semantic “mosh pit:”

  1. Prohibit it (we know the success rate on that option)
  2. Ignore it (costly but more “successful” than #1)
  3. Create an app on top of the legacy app (an error repeated isn’t an error, it’s following precedent)
  4. Sample it (but what are you missing?)
  5. Map it (being mindful of cost/benefit)

Which one are you going to choose?

October 22, 2012

HBase Futures

Filed under: Hadoop,HBase,Hortonworks,Semantics — Patrick Durusau @ 2:28 pm

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

Searching Big Data’s Open Source Roots

Filed under: BigData,Hadoop,Lucene,LucidWorks,Mahout,Open Source,Solr — Patrick Durusau @ 1:56 pm

Searching Big Data’s Open Source Roots by Nicole Hemsoth.

Nicole talks to Grant Ingersoll, Chief Scientist at LucidWorks, about the open source roots of big data.

No technical insights but a nice piece to pass along to the c-suite. Investment in open source projects can pay rich dividends. So long as you don’t need them next quarter. 😉

And a snapshot of where we are now, which is on the brink of new tools and capabilities in search technologies.

HBase at Hortonworks: An Update [Features, Consumer Side?]

Filed under: Hadoop,HBase,Hortonworks — Patrick Durusau @ 3:37 am

HBase at Hortonworks: An Update by Devaraj Das.

From the post:

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform. HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways. We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh. This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0).

The post concludes:

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base. When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase. HBase is maturing fast, becoming both more operationally reliable and more feature rich.

When a technical infrastructure becomes “feature rich,” can “features” for consumer services/interfaces be far behind?

Delivering location-based coupons for latte’s on a cellphone may seem like a “feature.” But we can do that with a man wearing a sandwich board.

A “feature” for the consumer needs to be more than digital imitation of an analog capability.

What consumer “feature(s)” would you offer based on new features in HBase?

« Newer PostsOlder Posts »

Powered by WordPress