Azure Outage Caused by Networking Glitch, GM Says

UPDATED 17:00 EDT / JULY 31 2012

Azure Outage Caused by Networking Glitch, GM Says

Microsoft is one of the latest companies trying to clean up reputation damage inflicted by an outage of its cloud services. Mike Neil, the general manager in charge of Microsoft’s Windows Azure, wrote a blog post explaining the reason for the service interruption that hit customers in Western Europe last week. The blackout happened on July 26 and made Azure’s Compute Service unavailable for about two and a half hours.

Neil explained Microsoft managed to trace the issue to a networking glitch that snowballed into a massive service disruption. According to Neil,

“The service interruption was triggered by a misconfigured network device that disrupted traffic to one cluster in our West Europe sub-region. Once a set device limit for external connections was reached, it triggered previously unknown issues in another network device within that cluster, which further complicated network management and recovery.”

Although Microsoft restored the service and knows the networking problem was a catalyst, the root cause of the issue has not been determined. Microsoft is working hard to change that. Neil said Microsoft has assigned a lot of manpower to analyze the outage and discover its source. Neil said more details would be posted on the official Azure blog sometime later this week.

The cloud has turned into a huge phenomenon in the last couple of years, and services like AWS and Azure are spearheading it. The two cloud behemoths are dominating the IaaS space with user bases that span individual developers and large enterprises. This popularity is obviously good for business, but its also a burden. Every outage or misstep is magnified in the public eye.

The 2.5 hour Azure outage only impacted one portion of its customers, but that’s still a large number of users – including companies that rely on the service to run their business. Microsoft isn’t the only cloud service with recent problems. AWS also had an outage earlier this month. An electrical issue at the company’s North Virginia availability zone took down a number of major services including Netflix and Instagram. Not only did the outage impact AWS’ costumers, it impacted their customers’ customers and led to a number of other issues that companies like Netflix had to resolve after the service was restored.

The cloud has enormous potential, but users have to realize that cloud services can down just like any other IT component. The cloud doesn’t eliminate the need for SLAs, redundancy or sound architectural practices. The cloud may be great, but it’s not magical.

Since you’re here …

… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.