Applying Parallel Prediction to Big Data

Applying Parallel Prediction to Big Data by Dan McClary (Principal Product Manager for Big Data and Hadoop at Oracle).

From the post:

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.

Playing Weatherman

I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.

For the problem at hand, that may be overkill and we can get it solved in an easier way, without understanding Mahout. Something becomes apparent on thinking about the problem: I don’t want my climate model for San Francisco to include the weather data from Providence, RI. Weather is a local problem and we want to model it locally. Therefore what we need is many models across different subsets of data. For the purpose of example, I’d like to model the weather on a state-by-state basis. But if I have to build 50 models sequentially, tomorrow’s weather will have happened before I’ve got a national forecast. Fortunately, this is an area where Pig shines.

Two quick observations:

First, Dan makes my point about your needing the “right” data, which may or may not be the same thing as “big data.” Decide what you want to do before you reach for big iron and data.

Second, I never hear references to the “weatherman” without remembering: “you don’t need to be a weatherman to know which way the wind blows.” (link to the manifesto) If you prefer a softer version, Subterranean Homesick Blues by Bob Dylan.

Comments are closed.