UPDATED 06:00 EDT / AUGUST 01 2012

As Promised, NetFlix Open Sources Chaos Monkey

Content streaming and movie rental service Netflix  utilizes cloud computing to power its core operations. In fact, the bulk of Netflix’s infrastructure is cloud-based, and it is one of Amazon Web Services’ (AWS) largest customers. Netflix has developed an entire arsenal of tools that help it manage its massive cloud environment and more efficiently manage outages and technical issues.

Netflix refers to these tools as the Simian Army. The software includes colorful named items like Latency Monkey, Chaos Gorilla and Chaos Monkey. If you couldn’t by the name, Chaos Monkey is a scaled-down version of Chaos Gorilla. (Who says developers don’t have a sense of humor.)

Chaos Monkey is a service that runs on AWS and improves application resiliency by helping ensure an application can remain running if an instance unexpectedly shuts down – a universally helpful capability for any cloud-based application. Chaos Monkey works by randomly killing instances. If an application is well designed, the outage of a single node shouldn’t impact it. Developers can use the service to identify unnecessary dependencies and weed out architectural problems. Chaos Monkey was developed for AWS, but according to Netflix it is flexible enough to work with other cloud providers.

As promised in April, Netflix has made the code publicly available as open source. The company announced the Chaos Monkey’s open source launch in an official blog post. According to the post, developers that use the service can be confident the tool has already been field tested. The announcement explained,

 “Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don’t happen again.”

The code for Chaos Monkey is available on GitHub. In addition to Chaos Monkey, Janitor Monkey, a tool similar to Cloudability that tracks down unused resources, might be the next open source candidate.

Incidents like the recent Amazon outage and Azure’s Western European blackout show the importance of such solutions. In spite of Netflix’s preparation, the AWS failure still managed to take the service down. Netflix’s availability architecture did manage to reduce the impact of the damage.


Since you’re here …

… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.