Chaos Monkey released into the wild by Cory Bennett and Ariel Tseitlin
From the post:
We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. We are excited to make a long-awaited announcement today that will help others who embrace this approach.
We have written about our Simian Army in the past and we are now proud to announce that the source code for the founding member of the Simian Army, Chaos Monkey, is available to the community.
Do you think your applications can handle a troop of mischievous monkeys loose in your infrastructure? Now you can find out.
What is Chaos Monkey?
Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group. The software design is flexible enough to work with other cloud providers or instance groupings and can be enhanced to add that support. The service has a configurable schedule that, by default, runs on non-holiday weekdays between 9am and 3pm. In most cases, we have designed our applications to continue working when an instance goes offline, but in those special cases that they don’t, we want to make sure there are people around to resolve and learn from any problems. With this in mind, Chaos Monkey only runs within a limited set of hours with the intent that engineers will be alert and able to respond.
At first I was unsure if NetFlix is hopeful its competitors will run Chaos Monkey or if they really run it internally. 😉
It certainly is a way to test your infrastructure. And quite possibly a selling point to clients who want more than projected or historical robustness.
Makes me curious, allowing for different infrastructures, how would you stress test a topic map installation?
And do so on a regular basis?
I first saw this at Alex Popescu’s myNoSQL.