Architectural Patterns for Near Real-Time Data Processing with Apache Hadoop by Ted Malaska.
From the post:
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.
In this post, I will outline the four major streaming patterns that we have encountered with customers running enterprise data hubs in production, and explain how to implement those patterns architecturally on Hadoop.
Streaming Patterns
The four basic streaming patterns (often used in tandem) are:
- Stream ingestion: Involves low-latency persisting of events to HDFS, Apache HBase, and Apache Solr.
- Near Real-Time (NRT) Event Processing with External Context: Takes actions like alerting, flagging, transforming, and filtering of events as they arrive. Actions might be taken based on sophisticated criteria, such as anomaly detection models. Common use cases, such as NRT fraud detection and recommendation, often demand low latencies under 100 milliseconds.
- NRT Event Partitioned Processing: Similar to NRT event processing, but deriving benefits from partitioning the data—like storing more relevant external information in memory. This pattern also requires processing latencies under 100 milliseconds.
- Complex Topology for Aggregations or ML: The holy grail of stream processing: gets real-time answers from data with a complex and flexible set of operations. Here, because results often depend on windowed computations and require more active data, the focus shifts from ultra-low latency to functionality and accuracy.
In the following sections, we’ll get into recommended ways for implementing such patterns in a tested, proven, and maintainable way.
…
Great post on patterns for near real-time data processing.
What I have always wondered about is how much of a use case is there for “near real-time processing” of data? If human decision makers are in the loop, that is outside of ecommerce and algorithmic trading, what is the value-add of “near real-time processing” of data?
For example, Kai Wähner in Real-Time Stream Processing as Game Changer in a Big Data World with Hadoop and Data Warehouse gives the following as common use cases for “near real-time processing” of data”
- Network monitoring
- Intelligence and surveillance
- Risk management
- E-commerce
- Fraud detection
- Smart order routing
- Transaction cost analysis
- Pricing and analytics
- Market data management
- Algorithmic trading
- Data warehouse augmentation
Ecommerce, smart order routing, algorithmic trading all fall into the category of no human involved so those may need real-time processing.
But take network monitoring for example. From the news reports I understand that hackers had free run of the Sony network for months. I suppose you must have network monitoring at all before real-time network monitoring would be useful at all.
I would probe to make sure that “real-time” was necessary for the use cases at hand before simply assuming it. In smaller organizations, access to data and “real-time” results are more often a symptom of control issues as opposed to any actual use case for the data.