Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 20, 2011

MAD Skills: New Analysis Practices for Big Data

Filed under: Analytics,BigData,Data Integration,SQL — Patrick Durusau @ 3:33 pm

MAD Skills: New Analysis Practices for Big Data by Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton.

Abstract:

As massive data acquisition and storage becomes increasingly aff ordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world’s largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present data-parallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

I found this passage very telling:

These desires for speed and breadth of data raise tensions with Data Warehousing orthodoxy. Inmon describes the traditional view:

There is no point in bringing data … into the data warehouse environment without integrating it. If the data arrives at the data warehouse in an unintegrated state, it cannot be used to support a corporate view of data. And a corporate view of data is one of the essences of the architected environment [13]

Unfortunately, the challenge of perfectly integrating a new data source into an “architected” warehouse is often substantial, and can hold up access to data for months – or in many cases, forever. The architectural view introduces friction into analytics, repels data sources from the warehouse, and as a result produces shallow incomplete warehouses. It is the opposite of the MAD ideal.

Marketing question for topic maps: Do you want a shallow, incomplete data warehouse?

Admittedly there is more to it, topic maps enable the integration of both data structures as well as the data itself. Both are subjects in the view of topic maps. Not to mention capturing the reasons why certain structures or data were mapped to other structures or data. I think the name for that is an audit trail.

Perhaps we should ask: Does your data integration methodology offer an audit trail?

(See MADLib for the source code growing out of this effort.)

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress