MADlib « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 30, 2013

MADlib

Filed under: Analytics,Machine Learning,MADlib,Mathematics,Statistics — Patrick Durusau @ 6:58 pm

From the webpage:

MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data.

The MADlib mission: to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development.

Until the Impala post called my attention to it, I didn’t realize that MADlib had an upgrade earlier in October to 1.3!

Congratulations to MADlib!

Comments Off

Use MADlib Pre-built Analytic Functions….

Filed under: Analytics,Cloudera,Impala,Machine Learning,MADlib — Patrick Durusau @ 6:53 pm

How-to: Use MADlib Pre-built Analytic Functions with Impala by Victor Bittorf.

From the post:

Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.

Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.

As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.

Victor walks through several examples of data analytics but for those of you who want to cut to the chase:

This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.

Usual cautions apply: The results of your script or model may or may not have any resemblance to “facts” as experienced by others.

Comments (1)