LinkedIn’s newest open-source tool will crash your application to test its resilience
Practically every important application in the enterprise today comes with safeguards designed to protect against common technical problems like server outages. But there’s often a big difference between how a service is expected to handle an issue and how it does so in practice, which requires organizations to painstakingly test for any weak points that may have slipped through their quality controls. LinkedIn Inc. moved to ease the chore this week by open-sourcing the homegrown system that its engineers use internally to assess the resilience of its infrastructure.
The company developed Simoorg, as the software is called, after finding older failure induction technologies like Chaos Monkey (the brainchild of fellow web giant Netflix Inc.) to be inadequate for its purposes. LinkedIn needed a tool that can not only check how well a workloads deal with technical trouble in general, but also simulate specific operational conditions where its internal processes are likely to run into trouble. That includes every small detail down to the amount of traffic an application handles and how much latency it’s experiencing.
Simoorg also provides the ability to customize the way a test is carried out to ensure that it’s reflective of what a real-life outage would look like. An engineer could point the system at a certain group of servers, set how long each machine will be taken offline and then specify the precise sequence in which the process should be executed. LinkedIn even included the option to have hardware components disabled at a random order, an addition that makes it possible to check how an application performs in situations that the IT department can’t necessarily anticipate.
The versatility of Simoorg enables organizations to simulate everything from the effects of a bad patch to severe hardware failures spread out throughout an entire data center. Its customizability also allows for tests to be tweaked with relative ease, which gives users the ability to explore more nuanced issues like whether a service’s susceptibility to hardware outages increases above a certain traffic threshold. The knowledge gleaned using the system is useful both for developers looking to improve the resilience of their applications and operations professionals charged with troubleshooting problems with their organizations’ infrastructure.
Image via Pixabay
Since you’re here …
… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.
If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.