Applying the Big Data Lambda Architecture by Michael Hausenblas.
From the article:
Based on his experience working on distributed data processing systems at Twitter, Nathan Marz recently designed a generic architecture addressing common requirements, which he called the Lambda Architecture. Marz is well-known in Big Data: He’s the driving force behind Storm and at Twitter he led the streaming compute team, which provides and develops shared infrastructure to support critical real-time applications.
Marz and his team described the underlying motivation for building systems with the lambda architecture as:
- The need for a robust system that is fault-tolerant, both against hardware failures and human mistakes.
- To serve a wide range of workloads and use cases, in which low-latency reads and updates are required. Related to this point, the system should support ad-hoc queries.
- The system should be linearly scalable, and it should scale out rather than up, meaning that throwing more machines at the problem will do the job.
- The system should be extensible so that features can be added easily, and it should be easily debuggable and require minimal maintenance.
From a bird’s eye view the lambda architecture has three major components that interact with new data coming in and responds to queries, which in this article are driven from the command line:
The goal of the article:
In this article, I employ the lambda architecture to implement what I call UberSocialNet (USN). This open-source project enables users to store and query acquaintanceship data. That is, I want to be able to capture whether I happen to know someone from multiple social networks, such as Twitter or LinkedIn, or from real-life circumstances. The aim is to scale out to several billions of users while providing low-latency access to the stored information. To keep the system simple and comprehensible, I limit myself to bulk import of the data (no capabilities to live-stream data from social networks) and provide only a very simple a command-line user interface. The guts, however, use the lambda architecture.
Something a bit challenging for the start of the week. 😉