Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 17, 2015

Can Spark Streaming survive Chaos Monkey?

Filed under: Software,Software Engineering,Spark — Patrick Durusau @ 12:57 pm

Can Spark Streaming survive Chaos Monkey? by Bharat Venkat, Prasanna Padmanabhan, Antony Arokiasamy, Raju Uppalap.

From the post:

Netflix is a data-driven organization that places emphasis on the quality of data collected and processed. In our previous blog post, we highlighted our use cases for real-time stream processing in the context of online recommendations and data monitoring. With Spark Streaming as our choice of stream processor, we set out to evaluate and share the resiliency story for Spark Streaming in the AWS cloud environment. A Chaos Monkey based approach, which randomly terminated instances or processes, was employed to simulate failures.

Spark on Amazon Web Services (AWS) is relevant to us as Netflix delivers its service primarily out of the AWS cloud. Stream processing systems need to be operational 24/7 and be tolerant to failures. Instances on AWS are ephemeral, which makes it imperative to ensure Spark’s resiliency.

If Spark was commercial product this is where you would see in bold, not a vendor report, from a customer.

You need to see the post for the details but so you know what to expect:

Component
Type
Behaviour on Component Failure
Resilient
Driver
Process
Client Mode: The entire application is killed
Cluster Mode with supervise: The Driver is restarted on a different Worker node
Master
Process
Single Master: The entire application is killed
Multi Master: A STANDBY master is elected ACTIVE
Worker Process
Process
All child processes (executor or driver) are also terminated and a new worker process is launched
Executor
Process
A new executor is launched by the Worker process
Receiver
Thread(s)
Same as Executor as they are long running tasks inside the Executor
Worker Node
Node
Worker, Executor and Driver processes run on Worker nodes and the behavior is same as killing them individually

I can think of few things more annoying that software that works, sometimes. If you want users to rely upon you, then your service will have to be reliable.

A performance post by Netflix is rumored to be in the offing!

Enjoy!

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress