Lab Report: The Final Grade by Dr. Geoffrey Malafsky.
From the post:
We have completed our TechLab series with Cloudera. Its objective was to explore the ability of Hadoop in general, and Cloudera’s distribution in particular, to meet the growing need for rapid, secure, adaptive merging and correction of core corporate data. I call this Corporate Small Data which is:
“Structured data that is the fuel of an organization’s main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.”1
Corporate Small Data does not include the predominant Big Data examples which are almost all stochastic use cases. These can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative, such as for executive decision making, accounting, risk management, regulatory compliance, and security.
Dr. Malafsky gives Cloudera high marks (A-) for use in enterprises and what he describes as “data normalization.” Not in the relational database sense but more in the data cleaning sense.
While testing a Cloudera distribution at your next data cleaning exercise, ask yourself this question: OK, the processing worked great, but how to I avoid collecting all the information I needed for this project, again in the future?