Of Algebirds, Monoids, Monads, and Other Bestiary for Large-Scale Data Analytics by Michael G. Noll.
From the post:
Have you ever asked yourself what monoids and monads are, and particularly why they seem to be so attractive in the field of large-scale data processing? Twitter recently open-sourced Algebird, which provides you with a JVM library to work with such algebraic data structures. Algebird is already being used in Big Data tools such as Scalding and SummingBird, which means you can use Algebird as a mechanism to plug your own data structures – e.g. Bloom filters, HyperLogLog – directly into large-scale data processing platforms such as Hadoop and Storm. In this post I will show you how to get started with Algebird, introduce you to monoids and monads, and address the question why you get interested in those in the first place.
Goal of this article
The main goal of this is article is to spark your curiosity and motivation for Algebird and the concepts of monoid, monads, and category theory in general. In other words, I want to address the questions “What’s the big deal? Why should I care? And how can these theoretical concepts help me in my daily work?”
You can call this a “blog post” but I rarely see blog posts with a table of contents! 😉
The post should come with a warning: May require substantial time to read, digest, understand.
Just so you know, I was hooked by this paragraph early on:
So let me use a different example because adding
Int
values is indeed trivial. Imagine that you are working on large-scale data analytics that make heavy use of Bloom filters. Your applications are based on highly-parallel tools such as Hadoop or Storm, and they create and work with many such Bloom filters in parallel. Now the money question is: How do you combine or add two Bloom filters in an easy way?
Are you motivated?
I first saw this in a tweet by CompSciFact.