Image credit: Charles Daoud
Ever heard of Netflix's Simian Army?
It represents the collection of tools Netflix has been arming themselves with since 2011 to test the AWS infrastructure. These unusual tools are open source, widely adaptable, and a wonderfully chaotic way to test software for operational vulnerabilities.
In this post, we'll be focusing on Netflix's very first Simian soldier: Chaos Monkey.
What is Chaos Monkey?
In a post on the Netflix Tech Blog, the company explains that Chaos Monkey was built upon the philosophy of preventing failure by purposely inciting it. Their definition of Chaos Monkey (in layman's terms), is this:
"Chaos Monkey [is] a tool that randomly disables our production instances to make sure we can survive this common type of failure without any customer impact."
Essentially, the idea is to unleash a "wild monkey" in your data center to wreak havoc and randomly shoot down your instances – all while you code. Typically, this nerve-wrecking situation happens on a business day, in a carefully monitored environment with experienced engineers on standby.
How does Chaos Monkey benefit developers?
If you're still wondering why Chaos Monkey is even necessary when there are already plenty of (less stressful) software testing techniques out there, here are a few succinct reasons:
- It offers a unique learning opportunity to test software, learn its weaknesses, and build automatic recovery systems to address them.
- Developers will be incentivized to design services that are highly resilient against back-end outages from the start.
- The real-time chaos trains developers to react to failures more efficiently and increases their confidence about their ability to deal with them in the future.
The end game of Chaos Monkey is: If developers build automatic recovery systems to deal with a failure that occurred at 2 pm on a Wednesday, then they won't have to worry when the failure reoccurs at 3 am on a Sunday.
Chaos Monkey using Spinnaker
Spinnaker, a pioneering platform for safe and reliable software deployments to the cloud, has first-class integration with Netflix’s Chaos Monkey to ensure your apps are fool-proof.
Chaos monkey widget in Spinnaker.
The Chaos Monkey behaviors you can control are the following:
- Termination frequency: By default, Chaos Monkey has a mean time between terminations of two days. This means it will shut down an instance every two days. You can also set a minimum time between terminations, which defaults to one day meaning the monkey won't kill more than once a day for each group.
- Grouping: Chaos Monkey operates on groups of instances. You can configure whether the monkey defines a 'group' as an app, stack, or cluster. It will then terminate a random instance within one of those groups.
- Exceptions: You can keep the monkey away from certain instances. In the image shown above, Chaos Monkey is set to not touch instances in the prod account in the us-west-2 region with a stack of "staging" and a blank detail.
For information on how to build and deploy Chaos Monkey using Spinnaker, check this guide on GitHub. You can also tap into the brains of experienced developers on the Spinnaker Slack Channel for personalized assistance.
Get hands-on instruction on configuring Chaos Monkey in Spinnaker
As someone who "sleeps better as more apps prove their resiliency in production" thanks to Chaos Monkey, Basgall is more than ready to show you exactly how to configure and run Chaos Monkey to integrate with a running Spinnaker installation.
If you want to get the how-to from Basgall himself, join us and the open source community at the annual Spinnaker Summit. You'll also have the chance to ask industry leaders from companies like Google, Netflix and Armory.
Tickets to the Summit are being snapped up by the minute. Grab your ticket here to explore Spinnaker and network with the best and brightest in software delivery.