Facebook’s Worst Outage Makes Users Yearn for the Fail Whale
Facebook was perhaps the biggest topic in the cloud yesterday, as it shuts operation for 2.5 hours, experiencing its worst outage in 4 years. The problem is not just the website; its widgets, applications and the Like button, which seems to be everywhere on the internet, was down as well. This caused enumerable complaints from users.
The problem was caused by Facebook’s automated system that check invalid configuration values in its cache. Instead of helping, the system did otherwise, causing Facebook to make the tough decision of shutting the entire site down in order to prevent data loss.
“Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second,” says Robert Johnson, Director of Software Engineering of Facebook.
“To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover,” he added.
So did Facebook handle its outage well? The company blog was updated, indicating Facebook’s dedication to conveying ongoing issues to its massive user base. But some users complain that human testing should’ve prevented such debilitating errors in the first place, and writing a blog post about the issues after the fact does very little to address user concerns.
Several users also went on to suggest that Facebook create an error message page, similar to the Twitter fail whale, to keep them abreast as the problem occurs. Perhaps that’s why Twitter users really don’t seem too put off by site inaccessibility–as long as that fail whale is there to keep users in the know.
Since you’re here …
… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.
If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.