How-to: Use MADlib Pre-built Analytic Functions with Impala by Victor Bittorf.
From the post:
Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.
Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.
As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.
Victor walks through several examples of data analytics but for those of you who want to cut to the chase:
This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.
Usual cautions apply: The results of your script or model may or may not have any resemblance to “facts” as experienced by others.