Archive for the ‘Samza’ Category


Sunday, December 14th, 2014

GearPump (GitHub)

From the wiki homepage:

GearPump is a lightweight, real-time, big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. GearPump draws from a number of existing frameworks including MillWheel, Apache Storm, Spark Streaming, Apache Samza, Apache Tez, and Hadoop YARN while leveraging Akka actors throughout its architecture.

What originally caught my attention was this passage on the GitHub page:

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

Think about that for a second.

Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

The GitHub page features a word count example and pointers to the wiki with more examples.

What if every topic “knew” the index value of every topic that should merge with it on display to a user?

When added to a topic map it broadcasts its merging property values and any topic with those values responds by transmitting its index value.

When you retrieve a topic, it has all the IDs necessary to create a merged view of the topic on the fly and on the client side.

There would be redundancy in the map but de-duplication for storage space went out with preferences for 7-bit character values to save memory space. So long as every topic returns the same result, who cares?

Well, it might make a difference when the CIA want to give every contractor full access to its datastores 24×7 via their cellphones. But, until that is an actual requirement, I would not worry about the storage space overmuch.

I first saw this in a tweet from Suneel Marthi.


Sunday, September 22nd, 2013


From the webpage:

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

  • Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple call-back based "process message" API that should be familiar to anyone who's used Map/Reduce.
  • Managed state: Samza manages snapshotting and restoration of a stream processor's state. Samza will restore a stream processor's state to a snapshot consistent with the processor's last read messages when the processor is restarted.
  • Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
  • Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
  • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.
  • Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.
  • Processor isolation: Samza works with Apache YARN, which supports processor security through Hadoop's security model, and resource isolation through Linux CGroups.

Check out Hello Samza to try Samza. Read the Background page to learn more about Samza.

Ironic that I should find Samza when I was searching for the new incubator page for Storm. ­čśë

In fact, I found a comparison of Samza and Storm.

You can learn a great deal about Storm (and Samza) reading the comparison. It’s quite good.