Hadoop and Data Quality, Data Integration, Data Analysis by David Loshin.
From the post:
If you have been following my recent thread, you will of course be anticipating this note, in which we examine the degree to which our favorite data-oriented activities are suited to the elastic yet scalable massive parallelism promised by Hadoop. Let me first summarize the characteristics of problems or tasks that are amenable to the programming model:
- Two-Phased (2-φ) – one or more iterations of “computation” followed by “reduction.”
- Big data – massive data volumes preclude using traditional platforms
- Data parallel (Data-||) – little or no data dependence
- Task parallel (Task-||) – task dependence collapsible within phase-switch from Map to Reduce
- Unstructured data – No limit on requiring data to be structured
- Communication “light” – requires limited or no inter-process communication except what is required for phase-switch from Map to Reduce
OK, so I happen to agree with David’s conclusions. (see his post for the table) That isn’t the only reason I posted this note.
Rather I think this sort of careful analysis lends itself to test cases, which we can post and share with specification of the tasks performed.
Much cleaner and more enjoyable than the debates measured by who can sink the lowest fastest.
Test cases to suggest anyone?