Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 8, 2013

Natural Language Processing and Big Data…

Filed under: BigData,Hadoop,MapReduce,Natural Language Processing — Patrick Durusau @ 9:47 am

Natural Language Processing and Big Data: Using NLTK and Hadoop – Talk Overview by Benjamin Bengfort.

From the post:

My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.

Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.

If you can’t wait for the future posts, Benjamin’s presentation from April is here. Amusing but fairly sparse slides.

Looking forward to more posts in this series!


Big Data and Natural Language Processing – Part 1

The “Foo” of Big Data – Part 2

Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3

Hadoop for Preprocessing Language – Part 4

Beyond Preprocessing – Weakly Inferred Meanings – Part 5

May 7, 2013

Cloudera Development Kit (CDK)…

Filed under: Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:01 pm

Cloudera Development Kit (CDK): Hadoop Application Development Made Easier by Eric Sammer & Tom White.

From the post:

At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.

So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.

That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.

The CDK lives on GitHub where users can freely browse, download, fork, and contribute back to the source. Community contributions are not only welcome but strongly encouraged. Since most Java developers use tools such as Maven (or tools that are compatible with Maven repositories), artifacts are also available from the Cloudera Maven Repository for easy project integration.

The CDK is a collection of libraries, tools, examples, and docs engineered to simplify common tasks.

What’s In There Today

Our goal is to release a number of CDK modules over time. The first module that can be found in the current release is the CDK Data module; a set of APIs to drastically simplify working with datasets in Hadoop filesystems such as HDFS and the local filesystem. The Data module handles automatic serialization and deserialization of Java POJOs as well as Avro Records, automatic compression, file and directory layout and management, automatic partitioning based on configurable functions, and a metadata provider plugin interface to integrate with centralized metadata management systems (including HCatalog). All Data APIs are fully documented with javadoc. A reference guide is available to walk you through the important parts of the module, as well. Additionally, a set of examples is provided to help you see the APIs in action immediately.

Here’s to hoping that vendor support as shown for Hadoop, Lucene/Solr, R, (who am I missing?), continues and spreads to other areas of software development.

Hadoop Webinars (WANdisco)

Filed under: Hadoop,MapReduce — Patrick Durusau @ 2:28 pm

Hadoop Webinars (WANdisco)

From May to July, webinars on Hadoop:

A Hadoop Overview

Wednesday, May 15
10:00 a.m. PT
1:00 p.m. ET Register Now

In this webinar, we'll provide an overview of Hadoop’s history and architecture.

This session will highlight: 

  • Major components such as HDFS, MapReduce, and HBase – the NoSQL database management system used with Hadoop for real-time applications
  • A summary of Hadoop’s ecosystem
  • A  review of public and private cloud deployment options
  • Common business use cases
  • And more…

Hadoop: A Deep Dive

Wednesday, June 5
10:00 a.m. PT
1:00 p.m. ET Register Now

This session will present: 

  • Various Hadoop misconceptions (not all clusters are comprised of thousands of machines)
  • Information about real world Hadoop deployments
  • A detailed review of Hadoop’s ecosystem (Sqoop, Flume, Nutch, Oozie, etc.)
  • An in-depth look at HDFS
  • An explanation of MapReduce in relation to latency and dependence on other Hadoop activities
  • An introduction to concepts attendees will need as a prerequisite for subsequent training webinars covering MapReduce, HBase and other major components at a deeper technical level

Hadoop: A MapReduce Tutorial

Wednesday, June 19
10:00 a.m. PT
1:00 p.m. ET Register Now

This session will cover: 

  • MapReduce at a deep technical level
  • The history of MapReduce
  • How a MapReduce job works, its logical flow, and the rules and types of MapReduce jobs
  • Writing, de-bugging and testing MapReduce jobs
  • Various available workflow tools
  • And more…

Hadoop: HBase In-Depth

Wednesday, July 10
10:00 a.m. PT
1:00 p.m. ET Register Now

This session is a deep technical review covering:

  • Flexibility
  • Scalability
  • Components (cells, rows, columns, qualifiers)
  • Schema samples
  • Hardware requirements
  • And more…

Hard to say how “deep” the webinars will be able to get in only one (1) hour.

I have registered for all four (4) and will be reporting back on my experience.

May 4, 2013

Ambari for provisioning, managing and monitoring Hadoop

Filed under: Ambari,Hadoop — Patrick Durusau @ 2:51 pm

Ambari for provisioning, managing and monitoring Hadoop

From the post:

Ambari is 100% open source and included in HDP, greatly simplifying installation and initial configuration of Hadoop clusters. In this article we’ll be running through some installation steps to get started with Ambari. Most of the steps here are covered in the main HDP documentation here.

The first order of business is getting Ambari Server itself installed. There are different approaches to this, but for the purposes of this short tour, we’ll assume Ambari is already installed on its own dedicated node somewhere or on one of the nodes on the (future) cluster itself. Instructions can be found under the installation steps linked above. Once Ambari Server is running, the hard work is actually done. Ambari simplifies cluster install and initial configuration with a wizard interface, taking care of it with but a few clicks and decisions from the end user. Hit http://:8080 and log in with admin/admin. Upon logging in, we are greeted with a user-friendly, wizard interface. Welcome to Apache Ambari! Name that cluster and let’s get going.

Even if you are working on the bleeding edge of big data, somebody got to mind the cluster.

This will help you discuss the process of building a cluster with confidence. (Even if you are chary of taking on the task alone.)

May 1, 2013

Impala 1.0

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 7:30 pm

Impala 1.0: Industry’s First Production-Ready SQL-on-Hadoop Solution

From the post:

Cloudera, the category leader that sets the standard for Apache Hadoop in the enterprise, today announced the general availability of Cloudera Impala, its open source, interactive SQL query engine for analyzing data stored in Hadoop clusters in real time. Cloudera was first-to-market with its SQL-on-Hadoop offering, releasing Impala to open source as a public beta offering in October 2012. Since that time, it has worked closely with customers and open source users, rigorously testing and refining the platform in real world applications to deliver today’s production-hardened and customer validated release, designed from the ground-up for enterprise workloads. The company noted that adoption of the platform has been strong: over 40 enterprise customers and open source users are using Impala today, including 37signals, Expedia, Six3 Systems, Stripe, and Trion Worlds. With its 1.0 release, Impala extends Cloudera’s unified Platform for Big Data, which is designed specifically to bring different computation frameworks and applications to a single pool of data, using a common set of system resources.

The bigger data pools get, the more opportunity there is for semantic confusion.

Or to put that more positively, the greater the market for tools to lessen or avoid semantic confusion.

😉

Have you used Lua for MapReduce?

Filed under: Hadoop,MapReduce,Semantic Diversity — Patrick Durusau @ 1:36 pm

Have you used Lua for MapReduce?

From the post:

Lua as a cross platform programming language has been popularly used in games and embedded systems. However, due to its excellent use for configuration, it has found wider acceptance in other user cases as well.

Lua was inspired from SOL (Simple Object Language) and DEL(Data-Entry Language) and created by Roberto Ierusalimschy, Waldemar Celes, and Luiz Henrique de Figueiredo at the Pontifical Catholic University of Rio de Janeiro, Brazil. Roughly translated to ‘Moon’ in Portuguese, it has found many big takers like Adobe, Nginx, Wikipedia.

Another scripting language to use with MapReduce and Hadoop.

Have you ever noticed the Tower of Babel seems to follow human activity around?

First, it was building a tower to heaven – confuse the workforce.

Then it was other community efforts.

And many, many thens, later, it has arrived at MapReduce/Hadoop configuration languages.

Like a kaleidoscope, it just gets richer the more semantic diversity we add.

Do you wonder what the opposite of semantic diversity must look like?

Or if we are the cause, what would it mean to eliminate semantic diversity?

April 28, 2013

What’s New in Hue 2.3

Filed under: Hadoop,Hive,Hue,Impala — Patrick Durusau @ 3:43 pm

What’s New in Hue 2.3

From the post:

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

Here’s the new features list:

  • Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters
  • Table Browser: new application for managing Apache Hive databases, viewing table schemas and sample of content
  • Apache Oozie Bundles are now supported
  • SQL highlighting and auto-completion for Hive/Impala apps
  • Multi-query and highlight/run a portion of a query
  • Job Designer was totally restyled and now supports all Oozie actions
  • Oracle databases (11.2 and later) are now supported

Time to upgrade!

April 25, 2013

Hadoop Summit North America (June 26-27, 2013)

Filed under: Conferences,Hadoop,MapReduce — Patrick Durusau @ 1:44 pm

Hadoop Summit North America

From the webpage:

Hortonworks and Yahoo! are pleased to host the 6th Annual Hadoop Summit, the leading conference for the Apache Hadoop community. This two-day event will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture. It will also be an excellent networking event for developers, architects, administrators, data analysts, data scientists and vendors interested in advancing, extending or implementing Apache Hadoop.

Community Choice Selectees:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

If you need another reason to attend, it’s located in San Jose, California.

2nd best US location for a conference. #1 being New Orleans.

Beginner Tips For Elastic MapReduce

Filed under: Cloud Computing,Elastic Map Reduce (EMR),Hadoop,MapReduce — Patrick Durusau @ 1:08 pm

Beginner Tips For Elastic MapReduce by John Berryman.

From the post:

By this point everyone is well acquainted with the power of Hadoop’s MapReduce. But what you’re also probably well acquainted with is the pain that must be suffered when setting up your own Hadoop cluster. Sure, there are some really good tutorials online if you know where to look:

However, I’m not much of a dev ops guy so I decided I’d take a look at Amazon’s Elastic MapReduce (EMR) and for the most part I’ve been very pleased. However, I did run into a couple of difficulties, and hopefully this short article will help you avoid my pitfalls.

I often dream of setting up a cluster that requires a newspaper hat because of the oil from cooling the coils, wait!, that was replica of the early cyclotron, sorry, wrong experiment. 😉

I mean a cluster of computers humming and driving up my cooling bills.

But there are alternatives.

Amazon’s Elastic Map Reduce (EMR) is one.

You can learn Hadoop with Hortonworks Sandbox and when you need production power, EMR awaits.

From a cost effectiveness standpoint, that sounds like a good deal to me.

You?

PS: Someone told me today that Amazon isn’t a reliable cloud because they have downtime. It is true that Amazon does have downtime but that isn’t a deciding factor.

You have to consider the relationship between Amazon’s aggressive pricing and how much reliability you need.

If you are running flight control for a moon launch, you probably should not use a public cloud.

Or for a heart surgery theater. And a few other places like that.

If you mean the webservices for your < 4,000 member NGO, 100% guaranteed uptime is a recipe for someone making money, off of you.

April 23, 2013

MRQL – a SQL on Hadoop Miracle

Filed under: Hadoop,MRQL — Patrick Durusau @ 6:43 pm

MRQL – a SQL on Hadoop Miracle by Edward J. Yoon.

From the post:

Recently, the Apache Incubator accepted a new query engine for Hadoop and Hama, called MRQL (pronounced miracle), which was initially developed in 2011 by Leonidas Fegaras.

MRQL (MapReduce Query Language) is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop and Hama. MRQL has some overlapping functionality with Hive, Impala and Drill, but one major difference is that it can capture many complex data analysis algorithms that can not be done easily in those systems in declarative form. So, complex data analysis tasks, such as PageRank, k-means clustering, and matrix multiplication and factorization, can be expressed as short SQL-like queries, while the MRQL system is able to evaluate these queries efficiently.

Another difference from these systems is that the MRQL system can run these queries in BSP (Bulk Synchronous Parallel) mode, in addition to the MapReduce mode. With BSP mode, it achieves lower latency and higher speed. According to MRQL team, “In near future, MRQL will also be able to process very large data effectively fast without memory limitation and significant performance degradation in the BSP mode”.

Maybe I should turn my back on the newsfeed more often. 😉

I suspect the announcement and my distraction were unrelated.

This looks very important.

I can feel another Apache list subscription in the offing.

April 19, 2013

Schema on Read? [The virtues of schema on write]

Filed under: BigData,Database,Hadoop,Schema,SQL — Patrick Durusau @ 3:48 pm

Apache Hadoop and Data Agility by Ofer Mendelevitch.

From the post:

In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.

In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.

Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.

If a schema is supplied “on read,” how is data validation accomplished?

I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.

How do we validate the semantics of data when a schema is supplied on read?”

Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.

I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?

For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.

Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?

I concede that schema “on read” can be quite flexible.

On the other hand, let’s not discount the value of schema “on write” as well.

Analyzing Data with Hue and Hive

Filed under: Hadoop,Hive,Hue — Patrick Durusau @ 2:06 pm

Analyzing Data with Hue and Hive by Romain Rigaux.

From the post:

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

I think the demo would be more effective if a city known for good food, New Orleans, for example, had been chosen for the challenge.

But given the complexity of the cuisine, that would be a stress test for human experts.

What chance would Apache Hadoop have? 😉

April 18, 2013

Hadoop: The Lay of the Land

Filed under: Hadoop,MapReduce — Patrick Durusau @ 10:47 am

Hadoop: The Lay of the Land by Tom White.

From the post:

The core map-reduce framework for big data consists of several interlocking technologies. This first installment of our tutorial explains what Hadoop does and how the pieces fit together.

Big Data is in the news these days, and Apache Hadoop is one of the most popular platforms for working with Big Data. Hadoop itself is undergoing tremendous growth as new features and components are added, and for this reason alone, it can be difficult to know how to start working with it. In this three-part series, I explain what Hadoop is and how to use it, presenting a simple, hands-on examples that you can try yourself. First, though, let’s look at the problem that Hadoop was designed to solve.

Much later:

Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He is an engineer at Cloudera, a company set up to offer Hadoop tools, support, and training. He is the author of the best-selling O’Reilly book, Hadoop: The Definitive Guide.

If you are getting started with Hadoop or need a good explanation for others, start here.

I first saw this at: Learn How To Hadoop from Tom White in Dr. Dobb’s by Justin Kestelyn.

How Hadoop Works? HDFS case study

Filed under: Hadoop,HDFS — Patrick Durusau @ 4:14 am

How Hadoop Works? HDFS case study by Dane Dennis.

From the post:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The Hadoop library contains two major components HDFS and MapReduce, in this post we will go inside each HDFS part and discover how it works internally.

Knowing how to use Hadoop is one level of expertise.

Knowing how Hadoop works takes you to the next level.

One where you can better adapt Hadoop to your needs.

Understanding HDFS is a step in that direction.

April 16, 2013

Hadoop, The Perfect App for OpenStack

Filed under: Cloud Computing,Hadoop,Hortonworks,OpenStack — Patrick Durusau @ 6:03 pm

Hadoop, The Perfect App for OpenStack by Shaun Connolly.

From the post:

The convergence of big data and cloud is a disruptive market force that we at Hortonworks not only want to encourage but also accelerate. Our partnerships with Microsoft and Rackspace have been perfect examples of bringing Hadoop to the cloud in a way that enables choice and delivers meaningful value to enterprise customers. In January, Hortonworks joined the OpenStack Foundation in support of our efforts with Rackspace (i.e. OpenStack-based Hadoop solution for the public and private cloud).

Today, we announced our plans to work with engineers from Red Hat and Mirantis within the OpenStack community on open source Project Savanna to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds.

Why is this news important?

Because big data and cloud computing are two of the top priorities in enterprise IT today, and it’s our intention to work diligently within the Hadoop and OpenStack open source communities to deliver solutions in support of these market needs. By bringing our Hadoop expertise to the OpenStack community in concert with Red Hat (the leading contributor to OpenStack), Mirantis (the leading system integrator for OpenStack), and Rackspace (a founding member of OpenStack), we feel we can speed the delivery of operational agility and efficient sharing of infrastructure that deploying elastic Hadoop on OpenStack can provide.

Why is this news important for topic maps?

Have you noticed that none, read none of the big data or cloud efforts say anything about data semantics?

As if when big data and the cloud arrives, all your data integration problems will magically melt away.

I don’t think so.

What I think is going to happen is discordant data sets are going to start rubbing and binding on each other. Perhaps not a lot at first but as data explorers get bolder, the squeaks are going to get louder.

So loud in fact the squeaks (now tearing metal sounds) are going to attract the attention of… (drum roll)… the CEO.

What’s your answer for discordant data?

  • Ear plugs?
  • Job with another company?
  • Job in another country?
  • Job under an assumed name?

I would say none of the above.

Iterative Map Reduce – Prior Art

Filed under: Hadoop,MapReduce — Patrick Durusau @ 11:59 am

Iterative Map Reduce – Prior Art

From the post:

There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.

Short summaries of:

Other proposals to add to this list?

April 9, 2013

Apache Hadoop Patterns of Use: Refine, Enrich and Explore

Filed under: Hadoop,Hortonworks — Patrick Durusau @ 9:27 am

Apache Hadoop Patterns of Use: Refine, Enrich and Explore by Jim Walter.

From the post:

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?” Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic. The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

As an organization laser focused on developing, distributing and supporting Apache Hadoop for enterprise customers, we have been fortunate to have a unique vantage point.

With that, we’re delighted to share with you our new whitepaper ‘Apache Hadoop Patterns of Use’. The patterns discussed in the whitepaper are:

Refine: Collect data and apply a known algorithm to it in a trusted operational process.
Enrich: Collect data, analyze and present salient results for online apps.
Explore: Collect data and perform iterative investigation for value.

You can download it here, and we hope you enjoy it.

If you are looking for detailed patterns of use, you will be disappointed.

Runs about nine (9) pages in very high level summary mode.

What remains to be written (to my knowledge) is a collection of use patterns with a realistic amount of detail from a cross-section of Hadoop users.

That would truly be a compelling resource for the community.

One Hour Hadoop Cluster

Filed under: Ambari,Hadoop,Virtual Machines — Patrick Durusau @ 5:02 am

How to setup a Hadoop cluster in one hour using Ambari?

A guide to setting up a 3-node Hadoop cluster using Oracle’s VirtualBox and Apache Ambari.

HPC may not be the key to semantics but it can still be useful. 😉

April 8, 2013

WTF: [footnote 1]

Filed under: Cassovary,Graphs,Hadoop,Recommendation — Patrick Durusau @ 1:52 pm

WTF: The Who to Follow Service at Twitter by Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh.

Abstract:

WTF (“Who to Follow”) is Twitter’s user recommendation service, which is responsible for creating millions of connections daily between users based on shared interests, common connections, and other related factors. This paper provides an architectural overview and shares lessons we learned in building and running the service over the past few years. Particularly noteworthy was our design decision to process the entire Twitter graph in memory on a single server, which signicantly reduced architectural complexity and allowed us to develop and deploy the service in only a few months. At the core of our architecture is Cassovary, an open-source in-memory graph processing engine we built from scratch for WTF. Besides powering Twitter’s user recommendations, Cassovary is also used for search, discovery, promoted products, and other services as well. We describe and evaluate a few graph recommendation algorithms implemented in Cassovary, including a novel approach based on a combination of random walks and SALSA. Looking into the future, we revisit the design of our architecture and comment on its limitations, which are presently being addressed in a second-generation system under development.

You know it is going to be an amusing paper when footnote 1 reads:

The confusion with the more conventional expansion of the acronym is intentional and the butt of many internal jokes. Also, it has not escaped our attention that the name of the service is actually ungrammatical; the pronoun should properly be in the objective case, as in \whom to follow”.

😉

Algorithmic recommendations may miss the mark for an end user.

On the other hand, what about an authoring interface that supplies recommendations of associations and other subjects?

A paper definitely worth a slow read!

I first saw this at: WTF: The Who to Follow Service at Twitter (Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, Reza Zadeh).

HDFS File Operations Made Easy with Hue (demo)

Filed under: Hadoop,HDFS,Hue — Patrick Durusau @ 1:33 pm

HDFS File Operations Made Easy with Hue by Romain Rigaux.

From the post:

Managing and viewing data in HDFS is an important part of Big Data analytics. Hue, the open source web-based interface that makes Apache Hadoop easier to use, helps you do that through a GUI in your browser — instead of logging into a Hadoop gateway host with a terminal program and using the command line.

The first episode in a new series of Hue demos, the video below demonstrates how to get up and running quickly with HDFS file operations via Hue’s File Browser application.

Very nice 2:18 video.

Brings the usual graphical file interface to Hadoop (no small feat) but reminds me of every other graphical file interface.

To step beyond the common graphical file interface, why not:

  • Links to scripts that call a file
  • File ownership – show all files owned by a user
  • Navigation of files by content type(s)
  • Grouping of files by common scripts
  • Navigation of files by content
  • Grouping of files by script owners calling the files

are just a few of the possibilities that come to mind.

I would make the roles in those relationships explicit but that is probably my topic map background showing through.

April 3, 2013

MapR and Ubuntu

Filed under: Hadoop,MapR,MapReduce — Patrick Durusau @ 5:06 am

MapR has posted all of its Hadoop ecosystem source code to Github: MapR Technologies.

MapR has also partnered with Canonical to release the entire Hadoop stack for 12.04 LTS and 12.10 releases of Ubuntu on www.ubuntu.com starting April 25, 2013.

For details see: MapR Teams with Canonical to Deliver Hadoop on Ubuntu.

I first saw this at: MapR Turns to Ubuntu in Bid to Increase Footprint by Isaac Lopez.

March 31, 2013

Data accounts for up to 75 percent of value in half of businesses

Filed under: BigData,Hadoop — Patrick Durusau @ 9:05 am

Data accounts for up to 75 percent of value in half of businesses

From the post:

As the volume of data stored in the enterprise continues to grow, organizations see this information as representing a substantial portion of their assets. With tools such as Hadoop for Windows, businesses are unlocking the value of this data, Anthony Saxby, Microsoft U.K.’s data platform product marketing manager, said in a recent talk at Computing’s Big Data Summit 2013. According to Microsoft’s research, half of all organizations think their data represents 50 to 75 percent of their total value.

The challenge in unlocking this value is technology, Saxby said, according to Computing. Much of this information is internally siloed or separated from the external data sources that it could be combined with to create more effective, monetized results. Today’s businesses want to bring together unstructured and structured data to create new insights. With tools such as Hadoop, this type of analysis is increasingly possible. For instance, record label EMI uses a variety of data types across 25 countries to determine how to market music artists in different geographies.

The headline reminded me of Bilbo Baggins:

I don’t know half of you half as well as I should like; and I like less than half of you half as well as you deserve.

As the narrator notes:

This was unexpected and rather difficult.

I don’t follow the WSJ as closely as some but what of inventories, brick and mortar assets, accounts receivable, employees, IP, etc.?

Not that I doubt the value of data.

I do doubt the ability of businesses that manage by catch phrases like “big data,” “silos,” “unstructured and structured data,” Hadoop,” to realize its value.

Hadoop will figure in successful projects to “unlock data,” but only where it is used as a tool and not a magic bullet.

A clear understanding of data and its sources, how to measure ROI from its use, are only two of the keys to successful use of any data tool.

Pilling up data freed from internal silos upon data from external sources results in a big heap of data.

Impressive to the uninformed but it won’t increase your bottom line.

March 26, 2013

Analyzing Twitter Data with Apache Hadoop, Part 3:…

Filed under: Hadoop,Hive,Tweets — Patrick Durusau @ 12:52 pm

Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive by Jon Natkins.

From the post:

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

I didn’t realize I had missed this part of the Hive series until I saw it mentioned in the Hue post.

Good introduction to Hive.

BTW, is Twitter data becoming the “hello world” of data mining?

How-to: Analyze Twitter Data with Hue

Filed under: Hadoop,Hive,Hue,Tweets — Patrick Durusau @ 12:46 pm

How-to: Analyze Twitter Data with Hue by Romain Rigaux.

From the post:

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

The Hadoop ecosystem continues to improve!

Question: Is anyone keeping a current listing/map of the various components in the Hadoop ecosystem?

March 25, 2013

5 Pitfalls To Avoid With Hadoop

Filed under: Data Integration,Hadoop,MapReduce — Patrick Durusau @ 3:41 pm

5 Pitfalls To Avoid With Hadoop by Syncsort, Inc.

From the registration page:

Hadoop is a great vehicle to extract value from Big Data. However, relying only on Hadoop and common scripting tools like Pig, Hive and Sqoop to achieve a complete ETL solution can hinder success.

Syncsort has worked with early adopter Hadoop customers to identify and solve the most common pitfalls organizations face when deploying ETL on Hadoop.

  1. Hadoop is not a data integration tool
  2. MapReduce programmers are hard to find
  3. Most data integration tools don’t run natively within Hadoop
  4. Hadoop may cost more than you think
  5. Elephants don’t thrive in isolation

Before you give up your email and phone number for the “free ebook,” be aware it is a promotional piece for Syncsort DMX-h.

Which isn’t a bad thing but if you are expecting something different, you will be disappointed.

The observations are trivially true and amount to Hadoop not having a user facing interface, pre-written routines for data integration and tools that data integration users normally expect.

OK, but a hammer doesn’t come with blueprints, nails, wood, etc., but those aren’t “pitfalls.”

It’s the nature of a hammer that those “extras” need to be supplied.

You can either do that piecemeal or you can use a single source (the equivalent of Syncsort DMX-h).

Syncsort should be on your short list of data integration options to consider but let’s avoid loose talk about Hadoop. There is enough of that in the uninformed main stream media.

March 22, 2013

Apache Crunch (Top-Level)

Filed under: Apache Crunch,Hadoop,MapReduce — Patrick Durusau @ 12:34 pm

Apache Crunch (Top Level)

While reading Josh Wills post, Cloudera ML: New Open Source Libraries and Tools for Data Scientists, I saw that Apache Crunch became a top-level project at the Apache Software Foundation last month.

Congratulations to Josh and all the members of the Crunch community!

From the Apache Crunch homepage:

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines, and is based on Google’s FlumeJava library. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Running on top of Hadoop MapReduce, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.

You may be interested in: Crunch-133 Add Aggregator support for combineValues ops on secondary keys via maps and collections. It is an “open” issue.

March 17, 2013

M3R: Increased Performance for In-Memory Hadoop Jobs

Filed under: Hadoop,Main Memory Map Reduce (M3R),MapReduce — Patrick Durusau @ 3:42 pm

M3R: Increased Performance for In-Memory Hadoop Jobs by Avraham Shinnar, David Cunningham, Benjamin Herta, Vijay Saraswat.

Abstract:

Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged – including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets – while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their performance under the Hadoop engine.

The authors start with the assumption of “clean” data that has already been reduced to terabytes in size and that can be stored in main memory for “scores” of nodes as opposed to thousands of nodes. (score = 20)

And they make the point that main memory is only going to increase in the coming years.

While phrased as “interactive analytics (e.g. interactive machine learning),” I wonder if the design point is avoiding non-random memory?

And what the consequences of entirely random memory will have on algorithm design? Or the assumptions that drive algorithmic design?

One way to test the impact of large memory on design would be to award access to cluster with several terabytes of data on a competitive basis, for some time period, with all the code, data, runs, etc., being streamed to a pubic forum.

One qualification being that the user not already have access to that level of computing power at work. 😉

I first saw this at Alex Popescu’s Paper: M3R – Increased Performance for In-Memory Hadoop Jobs.

March 16, 2013

Non-Word Count Hello World

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:11 pm

Finally! A Hadoop Hello World That Isn’t A Lame Word Count! by John Berryman.

From the post:

So I got bored of the old WordCount Hello World, and being a fairly mathy person, I decided to make my own Hello World in which I coaxed Hadoop into transposing a matrix!

What? What’s that you say? You think that a matrix transpose MapReduce is way more lame than a word count? Well I didn’t say that we were going to be saving the world with this MapReduce job, just flexing our mental muscles a little more. Typically, when you run the WordCount example, you don’t even look at the java code. You just pat yourself on the back when the word “the” invariably revealed to be the most popular word in the English language.

The goal of this exercise is to present a new challenge and a simple challenge so that we can practice thinking about solving BIG problems under the sometimes unintuitive constraints of MapReduce. Ultimately I intend to follow this post up with exceedingly more difficult MapReduce problems to challenge you and encourage you to tackle your own problems.

So, without further adieu:

As John says, not much beyond the Word Count examples but it is a different problem.

The promise of more difficult MapReduce problems sounds intriguing.

Need to watch for following posts.

March 15, 2013

HBaseCon 2013

Filed under: Conferences,Hadoop — Patrick Durusau @ 12:45 pm

HBaseCon 2013


Abstracts are due by midnight on April 1, 2013.

Conference: Thursday, June 13, 2013
San Francisco Marriott Marquis

From the webpage:

Early Bird registration is now open (until April 23), and we’re asking all members of the community to submit abstracts for sessions pertaining to:

  • HBase internals and futures
  • Best practices for running HBase in production
  • HBase use cases and applications
  • How to contribute to HBase

Abstracts are due by midnight on April 1, 2013. You will be notified by the Program Committee about your proposal’s status by April 15, 2013.

Waiting for all the components in the Hadoop ecosystem to have separate but co-located conferences. That would be radically cool!

March 14, 2013

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop

Filed under: Data Structures,Hadoop,Parquet — Patrick Durusau @ 9:35 am

Introducing Parquet: Efficient Columnar Storage for Apache Hadoop by Justin Kestelyn.

From the post:

We’d like to introduce a new columnar storage format for Hadoop called Parquet, which started as a joint project between Twitter and Cloudera engineers.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language.

Parquet is built from the ground up with complex nested data structures in mind. We adopted the repetition/definition level approach to encoding such data structures, as described in Google’s Dremel paper; we have found this to be a very efficient method of encoding data in non-trivial object schemas.

Parquet is built to support very efficient compression and encoding schemes. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented. We separate the concepts of encoding and compression, allowing Parquet consumers to implement operators that work directly on encoded data without paying decompression and decoding penalty when possible.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Under heavy development so watch closely!

« Newer PostsOlder Posts »

Powered by WordPress