Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 28, 2012

Stratosphere

Filed under: BigData,MapReduce — Patrick Durusau @ 4:39 pm

Stratosphere

I saw a tweet from Stratosphere today saying: “20 secs per iteration for PageRank on a billion scale graph using #stratosphere’s iterative data flows.” Enough to get me to look further! šŸ˜‰

Tracking the source of the tweet, I found the homepage of Stratosphere and there read:

Stratosphere is a DFG-funded research project investigating “Information Management on the Cloud” and creating the Stratosphere System for Big Data Analytics. The current openly released version is 0.2 with many new features and enhancements for usability, robustness, and performance. See the Change Log for a complete list of new features.

What is the Stratosphere System?

The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises a rich stack of components with different programming abstractions for complex analytics tasks:

  1. An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases. Internally, Meteor scripts are translated into Sopremo algebra and optimized.
  2. A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations. PACT is based on second-order functions and features an optimizer that chooses parallelization strategies.
  3. An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.

Stratosphere is open source under the Apache License, Version 2.0. Feel free to download it, try it out and give feedback or ask for help on our mailing lists.

Meteor Language

Meteor is a textual higher-level language for rapid composition of queries. It uses a JSON-like data model and features in its core typical operation for analysis and transformation of (semi-) structured nested data.

The meteor language is highly extensible and supports the addition of custom operations that integrate fluently with the syntax, in order to create problem specific Domain Languages. Meteor queries are translated into Sopremo algebra, optimized, and transformed into PACT programs by the compiler.

PACT Programming Model

The PACT programming model is an extension of the well known MapReduce programming model. PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.

PACT programs are parallelized by a cost-based compiler that picks data shipping and local processing strategies such that network- and disk I/O is minimized. The compiler incorporates user code properties (when possible) to find better plans; it thus alleviates the need for many manual optimizations (such as job merging) that one typically does to create efficient MapReduce programs. Compiled PACT programs are executed by the Nephele Data Flow Engine.

Nephele Data Flow Engine

Nephele is a massively parallel data flow engine dealing with resource management, work scheduling, communication, and fault tolerance. Nephele can run on top of a cluster and govern the resources itself, or directly connect to an IaaS cloud service to allocate computing resources on demand.

Another big data contender!

Computational Finance with Map-Reduce in Scala [Since Quants Have Funding]

Filed under: Finance Services,MapReduce,Scala — Patrick Durusau @ 5:48 am

Computational Finance with Map-Reduce in Scala by Ron Coleman, Udaya Ghattamaneni, Mark Logan, and Alan Labouseur. (PDF)

Assuming the computations performed by quants are semantically homogeneous (a big assumption), the sources of their data and application of the outcomes, are not.

The clients of quants aren’t interested in you humming “…its a big world after all…,” etc. They are interested in furtherance of their financial operations.

Using topic maps to make an already effective tool more effective, is the most likely way to capture their interest. (Short of taking hostages.)

I first saw this in a tweet by Data Science London.

November 11, 2012

Cloudant Labs on Foundational MapReduce Literature

Filed under: MapReduce — Patrick Durusau @ 8:22 pm

Cloudant Labs on Foundational MapReduce Literature by Mike Miller.

From the post:

MapReduce is an incredibly powerful algorithm, especially when used to process large amounts of data using distributed systems of commodity hardware. It makes data processing in big, distributed, fault prone systems approachable for the typical developer. Its recent renaissance was one of the inspirations for the Cloudant product line. In fact, it helped inspire the creation of the company itself. MapReduce has also come under recent scrutiny. There are multiple implementations with specific strengths and weaknesses. It is not the best tool for all jobs, and I’ve been outspoken on that exact issue. However, it is an incredibly powerful tool if applied judiciously. It is also an extremely simple concept and a fantastic first step into the world of distributed computing. I am therefore surprised at how few people have read a small selection of enlightening publications on MapReduce. I provide below a subjective selection that I have found particularly informative, as well as the main concepts I drew from each selection.

A nice selection of readings on MapReduce.

Enjoy!

November 5, 2012

The Week in Big Data Research [November 3, 2012]

Filed under: BigData,Hadoop,MapReduce — Patrick Durusau @ 11:22 am

The Week in Big Data Research from Datanami.

A new feature from Datanami that highlights academic research on big data.

In last Friday’s post you will find:

MapReduce-Based Data Stream Processing over Large History Data

Abstract:

With the development of Internet of Things applications based on sensor data, how to process high speed data stream over large scale history data brings a new challenge. This paper proposes a new programming model RTMR, which improves the real-time capability of traditional batch processing based MapReduce by preprocessing and caching, along with pipelining and localizing. Furthermore, to adapt the topologies to application characteristics and cluster environments, a model analysis based RTMR cluster constructing method is proposed. The benchmark built on the urban vehicle monitoring system shows RTMR can provide the real-time capability and scalability for data stream processing over large scale data.


Mastiff: A MapReduce-based System for Time-Based Big Data Analytics

Abstract:

Existing MapReduce-based warehousing systems are not specially optimized for time-based big data analysis applications. Such applications have two characteristics: 1) data are continuously generated and are required to be stored persistently for a long period of time, 2) applications usually process data in some time period so that typical queries use time-related predicates. Time-based big data analytics requires both high data loading speed and high query execution performance. However, existing systems including current MapReduce-based solutions do not solve this problem well because the two requirements are contradictory. We have implemented a MapReduce-based system, called Mastiff, which provides a solution to achieve both high data loading speed and high query performance. Mastiff exploits a systematic combination of a column group store structure and a lightweight helper structure. Furthermore, Mastiff uses an optimized table scan method and a column-based query execution engine to boost query performance. Based on extensive experiments results with diverse workloads, we will show that Mastiff can significantly outperform existing systems including Hive, HadoopDB, and GridSQL.


Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets

Abstract:

Scientific datasets, such as HDF5 and PnetCDF, have been used widely in many scientific applications. These data formats and libraries provide essential support for data analysis in scientific discovery and innovations. In this research, we present an approach to boost data analysis, namely Fast Analysis with Statistical Metadata (FASM), via data sub setting and integrating a small amount of statistics into datasets. We discuss how the FASM can improve data analysis performance. It is currently evaluated with the PnetCDF on synthetic and real data, but can also be implemented in other libraries. The FASM can potentially lead to a new dataset design and can have an impact on data analysis.


MapReduce Performance Evaluation on a Private HPC Cloud

Abstract:

The convergence of accessible cloud computing resources and big data trends have introduced unprecedented opportunities for scientific computing and discovery. However, HPC cloud users face many challenges when selecting valid HPC configurations. In this paper, we report a set of performance evaluations of data intensive benchmarks on a private HPC cloud to help with the selection of such configurations. More precisely, we study the effect of virtual machines core-count on the performance of 3 benchmarks widely used by the MapReduce community. We notice that depending on the computation to communication ratios of the studied applications, using higher core-counts virtual machines do not always lead to higher performance for data-intensive applications.


I manage to visit Datanami once or twice a day. Usually not as long as I should. šŸ˜‰ Visit, I think you will be pleasantly surprised.

PS: You will be seeing some of these articles in separate posts. Thought the cutting/bleeding edge types would like notice sooner rather than latter.

November 4, 2012

Atepassar Recommendations [social network recommender]

Filed under: MapReduce,Python,Recommendation — Patrick Durusau @ 3:56 pm

Atepassar Recommendations: Recommending friends with MapReduce and Python by Marcel Caraciolo.

From the post:

In this post I will present one of the tecnhiques used at AtƩpassar, a brazilian social network that help students around Brazil in order to pass the exams for a civil job, our recommender system.

(graphic omitted)

I will describe some of the data models that we use and discuss our approach to algorithmic innovation that combines offline machine learning with online testing. For this task we use distributed computing since we deal with over with 140 thousand users. MapReduce is a powerful technique and we use it by writting in python code with the framework MrJob. I recommend you to read further about it at my last post here.

One of our recommender techniques is the simple ‘people you might know‘ recommender algorithm. Indeed, there are several components behind the algorithm since at AtĆ©passar, users can follow other people as also be followed by other people. In this post I will talk about the basic idea of the algorithm which can be derivated for those other components. The idea of the algorithm is that if person A and person B do know each other but they have a lot of mutual friends, then the system should recommend that they connect with each other.

Is there a presumption in social recommendation programs that there are no duplicate people in the network? Using different names? If two people have exactly the same friends, is there some chance they could be the same person?

How many “same” friends would you require? 20? 30? 50? Some other number?

Curious because determining personal identity and identity of the people behind two or more entries, may be a matter of pattern matching.

BTW, this is a interesting looking blog. You may want to browse older entries or even subscribe.

October 31, 2012

One To Watch: Apache Crunch

Filed under: Apache Crunch,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:37 pm

One To Watch: Apache Crunch by Chris Mayer.

From the post:

Over the past few years, the Apache Software Foundation has become the hub for big data-focused projects. An array of companies have recognised the worth of housing their latest innovative projects at the ASF, with Apache Hadoop and Apache Cassandra two shining examples.

Amongst the number of projects arriving in the Apache Incubator was Apache Crunch. Crunch is a Java library created to eliminate the tedium of writing a MapReduce pipeline. It aims to take hold of the entire process, making writing, testing, and running MapReduce pipelines more efficient and ā€œeven funā€ (if this Cloudera blog post is to be believed).

That’s a tall order, to make MapReduce pipelines “even fun.” On the other hand, remarkable things have emerged from Apache for decades now.

A project to definitely keep in sight.

October 29, 2012

Top 5 Challenges for Hadoop MapReduce… [But Semantics Isn’t One Of Them]

Filed under: Hadoop,MapReduce — Patrick Durusau @ 1:46 pm

Top 5 Challenges for Hadoop MapReduce in the Enterprise

IBM sponsored content at Datanami.com lists these challenges for Hadoop MapReduce in enterprise settings:

  • Lack of performance and scalability….
  • Lack of flexible resource management….
  • Lack of application deployment support….
  • Lack of quality of service assurance….
  • Lack of multiple data source support….

Who would know enterprise requirements better than IBM? They have been in the enterprise business long enough to be an enterprise themselves.

If IBM says these are the top 5 challenges for Hadoop MapReduce in enterprises, it’s a good list.

But I don’t see “semantics” in that list.

Do you?

Semantics make it possible to combine data from different sources, process it and report a useful answer.

Or rather understanding data semantics and mapping between them makes a useful answer possible.

Try pushing data from different sources together without understanding and mapping their semantics.

It won’t take long for you to decide which way you prefer.

If semantics are critical to any data operation, including combining data from diverse sources, why do they get so little attention?

Doubt your IBM representative would know but you could ask them, while trying out the IBM solution to the “top 5 challenges for Hadoop MapReduce:”

How you should discover and then map the semantics of diverse data sources?

Having mapped them once, can you re-use that mapping for future projects with the IBM solution?

October 27, 2012

Designing good MapReduce algorithms

Filed under: Algorithms,BigData,Hadoop,MapReduce — Patrick Durusau @ 6:28 pm

Designing good MapReduce algorithms by Jeffrey D. Ullman.

From the introduction:

If you are familiar with “big data,” you are probably familiar with the MapReduce approach to implementing parallelism on computing clusters [1]. A cluster consists of many compute nodes, which are processors with their associated memory and disks. The compute nodes are connected by Ethernet or switches so they can pass data from node to node.

Like any other programming model, MapReduce needs an algorithm-design theory. The theory is not just the theory of parallel algorithmsā€”MapReduce requires we coordinate parallel processes in a very specific way. A MapReduce job consists of two functions written by the programmer, plus some magic that happens in the middle:

  1. The Map function turns each input element into zero or more key-value pairs. A “key” in this sense is not unique, and it is in fact important that many pairs with a given key are generated as the Map function is applied to all the input elements.
  2. The system sorts the key-value pairs by key, and for each key creates a pair consisting of the key itself and a list of all the values associated with that key.
  3. The Reduce function is applied, for each key, to its associated list of values. The result of that application is a pair consisting of the key and whatever is produced by the Reduce function applied to the list of values. The output of the entire MapReduce job is what results from the application of the Reduce function to each key and its list.

When we execute a MapReduce job on a system like Hadoop [2], some number of Map tasks and some number of Reduce tasks are created. Each Map task is responsible for applying the Map function to some subset of the input elements, and each Reduce task is responsible for applying the Reduce function to some number of keys and their associated lists of values. The arrangement of tasks and the key-value pairs that communicate between them is suggested in Figure 1. Since the Map tasks can be executed in parallel and the Reduce tasks can be executed in parallel, we can obtain an almost unlimited degree of parallelismā€”provided there are many compute nodes for executing the tasks, there are many keys, and no one key has an unusually long list of values

A very important feature of the Map-Reduce form of parallelism is that tasks have the blocking property [3]; that is, no Map or Reduce task delivers any output until it has finished all its work. As a result, if a hardware or software failure occurs in the middle of a MapReduce job, the system has only to restart the Map or Reduce tasks that were located at the failed compute node. The blocking property of tasks is essential to avoid restart of a job whenever there is a failure of any kind. Since Map-Reduce is often used for jobs that require hours on thousands of compute nodes, the probability of at least one failure is high, and without the blocking property large jobs would never finish.

There is much more to the technology of MapReduce. You may wish to consult, a free online text that covers MapReduce and a number of its applications [4].

Warning: This article may change your interest in the design of MapReduce algorithms.

Ullman’s stories of algorithm tradeoffs provide motivation to evaluate (or reevaluate) your own design tradeoffs.

October 19, 2012

Situational Aware Mappers with JAQL

Filed under: Hadoop,JAQL,MapReduce — Patrick Durusau @ 3:32 pm

Situational Aware Mappers with JAQL

From the post:

Adapting MapReduce for a higher performance has been one of the popular discussion topics. Letā€™s continue with our series on Adaptive MapReduce and explore the feature available via JAQL in IBM BigInsights commercial offering. This implementation also points to a much more vital corollary that enterprise offerings of Apache Hadoop are not just mere packaging and re-sell but have a bigger research initiative going on beneath the covers.

Two papers are explored by the post:

[1] Rares Vernica, Andrey Balmin, Kevin S. Beyer, Vuk Ercegovac: Adaptive MapReduce using situation-aware mappers. EDBT 2012: 420-431

Abstract:

We propose new adaptive runtime techniques for MapReduce that improve performance and simplify job tuning. We implement these techniques by breaking a key assumption of MapReduce that mappers run in isolation. Instead, our mappers communicate through a distributed meta-data store and are aware of the global state of the job. However, we still preserve the fault-tolerance, scalability, and programming API of MapReduce. We utilize these “situation-aware mappers” to develop a set of techniques that make MapReduce more dynamic: (a) Adaptive Mappers dynamically take multiple data partitions (splits) to amortize mapper start-up costs; (b) Adaptive Combiners improve local aggregation by maintaining a cache of partial aggregates for the frequent keys; (c) Adaptive Sampling and Partitioning sample the mapper outputs and use the obtained statistics to produce balanced partitions for the reducers. Our experimental evaluation shows that adaptive techniques provide up to 3x performance improvement, in some cases, and dramatically improve performance stability across the board.

[2] Andrey Balmin, Vuk Ercegovac, Rares Vernica, Kevin S. Beyer: Adaptive Processing of User-Defined Aggregates in Jaql. IEEE Data Eng. Bull. 34(4): 36-43 (2011)

Abstract:

Adaptive techniques can dramatically improve performance and simplify tuning for MapReduce jobs. However, their implementation often requires global coordination between map tasks, which breaks a key assumption of MapReduce that mappers run in isolation. We show that it is possible to preserve fault-tolerance, scalability, and ease of use of MapReduce by allowing map tasks to utilize a limited set of high-level coordination primitives. We have implemented these primitives on top of an open source distributed coordination service. We expose adaptive features in a high-level declarative query language, Jaql, by utilizing unique features of the language, such as higher-order functions and physical transparency. For instance, we observe that maintaining a small amount of global state could help improve performance for a class of aggregate functions that are able to limit the output based on a global threshold. Such algorithms arise, for example, in Top-K processing, skyline queries, and exception handling. We provide a simple API that facilitates safe and efficient development of such functions.

The bar for excellence in the use of Hadoop keeps getting higher!

October 4, 2012

Adapting MapReduce for realtime apps

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:28 pm

Adapting MapReduce for realtime apps

From the post:

As much as MapReduce is popular, so much is the discussion to make it even better from a generalized approach to higher performance oriented approach. We will be discussing a few frameworks which have tried to adapt MapReduce further for higher performance orientation.

The first post in this series tries will discuss AMREF, an Adaptive MapReduce Framework designed for real time data intensive applications. (published in the paper Fan Zhang, Junwei Cao, Xiaolong Song, Hong Cai, Cheng Wu: AMREF: An Adaptive MapReduce Framework for Real Time Applications. GCC 2010: 157-162.)

If you are interested in squeezing more performance out of MapReduce, this looks like a good starting place.

October 1, 2012

Disco [Erlang/Python – MapReduce]

Filed under: Disco,Erlang,MapReduce,Python — Patrick Durusau @ 9:16 am

Disco

From the webpage:

Disco is a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data.

Disco is powerful and easy to use, thanks to Python. Disco distributes and replicates your data, and schedules your jobs efficiently. Disco even includes the tools you need to index billions of data points and query them in real-time.

Install Disco on your laptop, cluster or cloud of choice and become a part of the Disco community!

I rather like the MapReduce graphic you will see at About.

I first saw this in Guido Kollerie’s post on the recent Python users meeting in the Netherlands. Guido details his 5 minute presentation on Disco.

September 29, 2012

Hadoop as Java Ecosystem “MVP”

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:12 pm

Apache Hadoop Wins Dukeā€™s Choice Award, is a Java Ecosystem ā€œMVPā€ by Justin Kestelyn.

From the post:

For those of you new to it, the Dukeā€™s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to ā€œcelebrate extreme innovation in the world of Java technologyā€ ā€“ in essence, itā€™s the ā€œMVPā€ of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.

For the 2012 awards, Iā€™m happy to report that Apache Hadoop is among the awardees – which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.

Very cool!

Kudos to the Apache Hadoop project!

September 27, 2012

Searching and Accessing Data in Riak (overview slides)

Filed under: MapReduce,Riak — Patrick Durusau @ 3:22 pm

Searching and Accessing Data in Riak by Andy Gross and Shanley Kane.

From the description:

An overview of methods for searching and aggregating data in Riak, covering Riak Search, secondary indexes and MapReduce. Reviews use cases and features for each method, when to use which, and the limitations and advantages of each approach. In addition, it covers query examples and the high-level architecture of each method.

If you are already familiar with search/access to data in Riak, you won’t find anything new here.

It would be useful to have some topic map specific examples written using Riak.

Sing out if you decide to pursue that train of thought.

September 26, 2012

Using information retrieval technology for a corpus analysis platform

Filed under: Corpora,Corpus Linguistics,Information Retrieval,Lucene,MapReduce — Patrick Durusau @ 3:57 pm

Using information retrieval technology for a corpus analysis platform by Carsten Schnober.

Abstract:

This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut fĆ¼r Deutsche Sprache (IDS Mannheim). It presents a method to use Luceneā€™s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

The support for multiple annotation layers is of particular interest to me because the “subjects” of interest in a text may vary from one reader to another.

Being mindful that for topic maps, the annotation layers and annotations themselves may be subjects for some purposes.

September 25, 2012

Location Sensitive Hashing in Map Reduce

Filed under: Hadoop,MapReduce — Patrick Durusau @ 3:23 pm

Location Sensitive Hashing in Map Reduce by Ricky Ho.

From the post:

Inspired by Dr. Gautam Shroff who teaches the class: Web Intelligence and Big data in coursera.org, there are many scenarios where we want to compute similarity between large amount of items (e.g. photos, products, persons, resumes … etc). I want to add another algorithm to my Map/Reduce algorithm catalog.

For the background of Map/Reduce implementation on Hadoop. I have a previous post that covers the details.

“Location” here is not used in the geographic sense but as a general measure of distance. Could be geographic, but could be some other measure of location as well.

September 15, 2012

MapReduce is Good Enough?… [How to Philosophize with a Hammer?]

Filed under: BigData,Hadoop,MapReduce — Patrick Durusau @ 2:50 pm

MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! by Jimmy Lin.

Abstract:

Hadoop is currently the large-scale data analysis “hammer” of choice, but there exist classes of algorithms that aren’t “nails”, in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is “good enough”, and that instead of trying to invent screwdrivers, we should simply get rid of everything that’s not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a “real-world” production analytics environment. From this combined perspective I reflect on the current state and future of “big data” research.

Following the abstract:

Authorā€™s note: I wrote this essay specifically to be controversial. The views expressed herein are more extreme than what I believe personally, written primarily for the purposes of provoking discussion. If after reading this essay you have a strong reaction, then Iā€™ve accomplished my goal šŸ™‚

The author needs to work on being “controversial.” He gives away the pose “throw away everything not a nail” far too early and easily.

Without the warnings, flashing lights, etc, the hyperbole might be missed, but not by anyone who would benefit from the substance of the paper.

The paper reflects careful thought on MapReduce and its limitations. Merits a careful and close reading.

I first saw this mentioned by John D. Cook.

September 12, 2012

Cloudera Enterprise in Less Than Two Minutes

Filed under: Cloud Computing,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:10 pm

Cloudera Enterprise in Less Than Two Minutes by Justin Kestelyn.

I had to pause “Born Under A Bad Sign” by Cream to watch the video but it was worth it!

Good example of selling technique too!

Focused on common use cases and user concerns. Promises a solution without all the troublesome details.

Time enough for that after a prospect is interested. And even then, ease them into the details.

September 11, 2012

Analyzing Big Data with Twitter

Filed under: BigData,CS Lectures,MapReduce,Pig — Patrick Durusau @ 3:39 am

Analyzing Big Data with Twitter

Not really with Twitter but with tools sponsored/developed/used by Twitter. Lecture series at the UC Berkeley School of Information.

Videos of lectures are posted online.

Check out the syllabus for assignments and current content.

Four (4) lectures so far!

  • Big Data Analytics with Twitter – Marti Hearst & Gilad Mishne. Introduction to Twitter in general.
  • Twitter Philosophy and Software Architecture – Othman Laraki & Raffi Krikorian.
  • Introduction to Hadoop – Bill Graham.
  • Apache Pig – Jon Coveney
  • … more to follow.

September 10, 2012

Automating Your Cluster with Cloudera Manager API

Filed under: Clustering (servers),HDFS,MapReduce — Patrick Durusau @ 3:01 pm

Automating Your Cluster with Cloudera Manager API

From the post:

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.

You can read the full API documentation here.

We are nearing mid-September so the holiday season will be here before long. It isn’t too early to start planning on price/hardware break points.

This will help configure a HDFS and MapReduce cluster on your holiday hardware.

September 6, 2012

Meet the Committer, Part One: Alan Gates

Filed under: Hadoop,Hortonworks,MapReduce,Pig — Patrick Durusau @ 7:52 pm

Meet the Committer, Part One: Alan Gates by Kim Truong.

From the post:

Series Introduction

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, weā€™re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work weā€™re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.

Alan Gates is a leader in our Hadoop education programs. That is why Iā€™m incredibly excited to kick off the next phase of our ā€œFuture of Apache Hadoopā€ webinar series. Weā€™re starting off this segment with 4-webinar series on September 12 with ā€œPig out to Hadoopā€ with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.

My only complaint is that the interview is too short!

Looking forward to the Pig webinar!

MapReduce Makes Further Inroads in Academia

Filed under: MapReduce — Patrick Durusau @ 7:42 pm

MapReduce Makes Further Inroads in Academia by Ian Armas Foster.

From the post:

Most conversations about Hadoop and MapReduce tend to filter in from enterprise quarters, but if the recent uptick in scholarly articles extolling its benefit for scientific and technical computing applications is any indication, the research world might have found its next open source darling.

Of course, itā€™s not just about making use of the approachā€”for many researchers, itā€™s about expanding, refining and tweaking the tool to make it suitable for new, heavy-hitting class of applications. As a result, research to improve MapReduceā€™s functionality and efficiency flourishes, which could eventually provide some great trickle-down technology for the business users as well.

As one case among an increasing number, researchers Marcelo Neves, Tiago Ferreto, and Cesar De Rose of PUCRS in Brazil are working to extend the capabilities of MapReduce. Their approach to MapReduce sought to tackle one of the more complex issues for MapReduce on high performance computing hardware. In this case, the mighty scheduling problem was the target.

The team recently proposed a new algorithm that would enhance MapReduceā€™s work rate and job scheduling called MapReduce Job Adaptor. Neves et al presented their algorithm in a recent paper.

I am not sure how (or if) it will be documented, but users of MapReduce should watch for how their analysis of a problem changes, based on the anticipated use of MapReduce.

Some academic is going to write the history of MapReduce on one or more problems. Could be you.

September 3, 2012

Small Data (200 MB up to 10 GB) [MySQL, MapReduce and Hive by the Numbers]

Filed under: Hive,MapReduce,MySQL — Patrick Durusau @ 1:09 pm

Study Stacks MySQL, MapReduce and Hive

From the post:

Many small and medium sized businesses would like to get in on the big data game but do not have the resources to implement parallel database management systems. That being the case, which relational database management system would provide small businesses the highest performance?

This question was asked and answered by Marissa Hollingsworth of Boise State University in a graduate case study that compared the performance rates of MySQL, Hadoop MapReduce, and Hive at scales no larger than nine gigabytes.

Hollingsworth also used only relational data, such as payment information, which stands to reason since anything more would require a parallel system. ā€œThis experiment,ā€ said Hollingsworth ā€œinvolved a payment history analysis which considers customer, account, and transaction data for predictive analytics.ā€

The case study, the full text of which can be found here, concluded that MapReduce would beat out MySQL and Hive for datasets larger than one gigabyte. As Hollingsworth wrote, ā€œThe results show that the single server MySQL solution performs best for trial sizes ranging from 200MB to 1GB, but does not scale well beyond that. MapReduce outperforms MySQL on data sets larger than 1GB and Hive outperforms MySQL on sets larger than 2GB.ā€

Although your friends may not admit it, some of them have small data. Or interact with clients with small data.

You print this post out and put it in their inbox. Anonymously. They will appreciate it even if they can’t acknowledge having seen it.

When thinking about data and data storage, you might want to keep the comparisons you will find at: How much is 1 byte, kilobyte, megabyte, gigabyte, etc.? in mind.

Roughly speaking, 1 GB is the equivalent of 4,473 books.

The 10 GB limit in this study is roughly 44,730 books.

Sometimes all you need is small data.

August 25, 2012

Introduction to Recommendations with Map-Reduce and mrjob [Ode to Similarity, Music]

Filed under: MapReduce,Music,Music Retrieval,Similarity — Patrick Durusau @ 10:56 am

Introduction to Recommendations with Map-Reduce and mrjob by Marcel Caraciolo

From the post:

In this post I will present how can we use map-reduce programming model for making recommendations. Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ? Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Usual recommendation post except for the emphasis on multiple tests of similarity.

Useful because simply reporting that two (or more) items are “similar” isn’t all that helpful. At least unless or until you know the basis for the comparison.

And have the expectation that a similar notion of “similarity” works for your audience.

For example, I read an article this morning about a “new” invention that will change the face of sheet music publishing, in three to five years. Invention Will Strike a Chord With Musicians

Despite the lack of terms like “markup,” “HyTime,” “SGML,” “XML,” “Music Encoding Initiative (MEI),” or “MusicXML,” all of those seemed quite “similar” to me. That may not be the “typical” experience but it is mine.

If you don’t want to wait three to five years for the sheet music revolution, you can check out MusicXML. It has been reported that more than 150 applications support MusicXML. Oh, that would be today, not three to five years from now.

You might want to pass the word along in the music industry before the next “revolution” in sheet music starts up.

August 13, 2012

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Filed under: Data Analysis,MapReduce — Patrick Durusau @ 3:19 pm

Using an In-Memory Data Grid for Near Real-Time Data Analysis by Dr. William Bain, ScaleOut Software, Inc.

Vendor content so usual disclaimers apply but this may signal an important but subtle shift in computing environments.

From the post:

Introduction

In todayā€™s competitive world, businesses need to make fast decisions to respond to changing market conditions and to maintain a competitive edge. The explosion of data that must be analyzed to find trends or hidden insights intensifies this challenge. Both the private and public sectors are turning to parallel computing techniques, such as “map/reduce” to quickly sift through large data volumes.

In some cases, it is practical to analyze huge sets of historical, disk-based data over the course of minutes or hours using batch processing platforms such as Hadoop. For example, risk modeling to optimize the handling of insurance claims potentially needs to analyze billions of records and tens of terabytes of data. However, many applications need to continuously analyze relatively small but fast-changing data sets measured in the hundreds of gigabytes and reaching into terabytes. Examples include clickstream data to optimize online promotions, stock trading data to implement trading strategies, machine log data to tune manufacturing processes, smart grid data, and many more.

Over the last several years, in-memory data grids (IMDGs) have proven their value in storing fast-changing application data and scaling application performance. More recently, IMDGs have integrated map/reduce analytics into the grid to achieve powerful, easy-to-use analysis and enable near real-time decision making. For example, the following diagram illustrates an IMDG used to store and analyze incoming streams of market and news data to help generate alerts and strategies for optimizing financial operations. This article explains how using an IMDG with integrated map/reduce capabilities can simplify data analysis and provide important competitive advantages.

Lowering the complexity of map/reduce, increasing operation speed (no file i/o), enabling easier parallelism, are all good things.

But they are differences in degree, not in kind.

I find IMDGs interesting because of the potential to increase the complexity of relationships between data, including data that is the output of operations.

From the post:

For example, an e-commerce Web site may need to monitor online shopping carts to see which products are selling.

Yawn.

That is probably a serious technical/data issue for Walmart or Home Depot, but it is a different in degree. You could do the same operations with a shoebox and paper receipts, although that would take a while.

Consider the beginning of something a bit more imaginative: What if sales at stores were treated differently than online shopping carts (due to delivery factors) and models built using weather forecasts three to five days out, time of year, local holidays and festivals? Multiple relationships between different data nodes.

That is just a back of an envelope sketch and I am sure successful retailers do even more than what I have suggested.

Complex relationships between data elements are almost at our fingertips.

Are you still counting shopping care items?

August 9, 2012

Apache Hadoop YARN ā€“ Background and an Overview

Filed under: Hadoop,Hadoop YARN,MapReduce — Patrick Durusau @ 3:39 pm

Apache Hadoop YARN ā€“ Background and an Overview by Arun Murth.

From the post:

MapReduce ā€“ The Paradigm

Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result. The simple, and fairly restricted, nature of the programming model lends itself to very efficient and extremely large-scale implementations across thousands of cheap, commodity nodes.

Apache Hadoop MapReduce is the most popular open-source implementation of the MapReduce model.

In particular, when MapReduce is paired with a distributed file-system such as Apache Hadoop HDFS, which can provide very high aggregate I/O bandwidth across a large cluster, the economics of the system are extremely compelling ā€“ a key factor in the popularity of Hadoop.

One of the keys to this is the lack of data motion i.e. move compute to data and do not move data to the compute node via the network. Specifically, the MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster. This significantly reduces the network I/O patterns and allows for majority of the I/O on the local disk or within the same rack ā€“ a core advantage.

An introduction to the architecture of Apache Hadoop YARN that starts with its roots in MapReduce.

August 6, 2012

r3 redistribute reduce reuse

Filed under: MapReduce,Python,Redis — Patrick Durusau @ 10:30 am

r3 redistribute reduce reuse

From the project homepage:

rĀ³ is a map-reduce engine written in python using redis as a backend

rĀ³ is a map reduce engine written in python using a redis backend. It’s purpose is to be simple.

rĀ³ has only three concepts to grasp: input streams, mappers and reducers.

You need to visit this project. It is simple, efficient and effective.

I found this following rĀ³ ā€“ A quick demo of usage, which I found at: Demoing the Python-Based Map-Reduce R3 Against GitHub Data, Alex Popescu’s myNoSQL.

August 5, 2012

More Fun with Hadoop In Action Exercises (Pig and Hive)

Filed under: Hadoop,Hive,MapReduce,Pig — Patrick Durusau @ 3:50 pm

More Fun with Hadoop In Action Exercises (Pig and Hive) by Sujit Pal.

From the post:

In my last post, I described a few Java based Hadoop Map-Reduce solutions from the Hadoop in Action (HIA) book. According to the Hadoop Fundamentals I course from Big Data University, part of being a Hadoop practioner also includes knowing about the many tools that are part of the Hadoop ecosystem. The course briefly touches on the following four tools – Pig, Hive, Jaql and Flume.

Of these, I decided to focus (at least for the time being) on Pig and Hive (for the somewhat stupid reason that the HIA book covers these too). Both of these are are high level DSLs that produce sequences of Map-Reduce jobs. Pig provides a data flow language called PigLatin, and Hive provides a SQL-like language called HiveQL. Both tools provide a REPL shell, and both can be extended with UDFs (User Defined Functions). The reason they coexist in spite of so much overlap is because they are aimed at different users – Pig appears to be aimed at the programmer types and Hive at the analyst types.

The appeal of both Pig and Hive lies in the productivity gains – writing Map-Reduce jobs by hand gives you control, but it takes time to write. Once you master Pig and/or Hive, it is much faster to generate sequences of Map-Reduce jobs. In this post, I will describe three use cases (the first of which comes from the HIA book, and the other two I dreamed up).

More useful Hadoop exercise examples.

August 4, 2012

Scalding

Filed under: MapReduce,Scala,Scalding — Patrick Durusau @ 3:07 pm

Scalding: Powerful & Concise MapReduce Programming

Description:

Scala is a functional programming language on the JVM. Hadoop uses a functional programming model to represent large-scale distributed computation. Scala is thus a very natural match for Hadoop.

In this presentation to the San Francisco Scala User Group, Dr. Oscar Boykin and Dr. Argyris Zymnis from Twitter give us some insight on Scalding DSL and provide some example jobs for common use cases.

Twitter uses Scalding for data analysis and machine learning, particularly in cases where we need more than sql-like queries on the logs, for instance fitting models and matrix processing. It scales beautifully from simple, grep-like jobs all the way up to jobs with hundreds of map-reduce pairs.

The Alice example failed (counted the different forms of Alice differently). I am reading a regex book so that may have made the problem more obvious.

Lesson: Test code/examples before presentation. šŸ˜‰

See the Github repository: https://github.com/twitter/scalding.

Both Scalding and the presentation are worth your time.

August 3, 2012

Introducing Apache Hadoop YARN

Filed under: Hadoop,Hadoop YARN,HDFS,MapReduce — Patrick Durusau @ 3:03 pm

Introducing Apache Hadoop YARN by Arun Murthy.

From the post:

Iā€™m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!

Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on itā€™s own as a sub-project of Hadoop.

In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.

As folks are aware, Hadoop HDFS is the data storage layer for Hadoop and MapReduce was the data-processing layer. However, the MapReduce algorithm, by itself, isnā€™t sufficient for the very wide variety of use-cases we see Hadoop being employed to solve. With YARN, Hadoop now has a generic resource-management and distributed application framework, where by, one can implement multiple data processing applications customized for the task at hand. Hadoop MapReduce is now one such application for YARN and I see several others given my vantage point ā€“ in future you will see MPI, graph-processing, simple services etc.; all co-existing with MapReduce applications in a Hadoop YARN cluster.

Considering the explosive growth of Hadoop, what new data processing applications do you see emerging first in YARN?

August 2, 2012

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster ā€“ Part III

Filed under: Bioinformatics,Biomedical,Hadoop,MapReduce — Patrick Durusau @ 9:23 pm

Processing Rat Brain Neuronal Signals Using a Hadoop Computing Cluster ā€“ Part III by Jadin C. Jackson, PhD & Bradley S. Rubin, PhD.

From the post:

Up to this point, weā€™ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Hadoop and Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework weā€™ve built so far.

Biomedical researchers will be interested in the results but I am more interested in the observation that Hadoop makes it possible to retain results for ad hoc analysis.

« Newer PostsOlder Posts »

Powered by WordPress