In a cloud outage, no one can hear you scream

The most recent Azure outage shows Microsoft still needs to improve its responsiveness to customers

cloud trouble storm lightning man umbrella — Thinkstock

Nothing is certain but death and taxes … and cloud outages. Microsoft's own chief reliability strategist, David Bills, has written that cloud service failure is "inevitable," and events this week proved him right -- again -- as Microsoft Azure suffered a widespread outage that disrupted services across the United States, Europe, and parts of Asia. Given that inevitability, why are cloud providers often inadequate at providing customers with information during an event?

Amazon, Microsoft, and Google are fighting tooth and nail to win cloud business, and their responsiveness to customers during an outage can be a key differentiator. Both Amazon Web Services and Microsoft Azure maintain status pages to keep cloud customers apprised of problems, but those pages are not always kept up to date. Google may be coming on strong in the cloud, but it's still in the testing stage with a status dashboard for Google Cloud Platform.

Customers experiencing service problems on Tuesday turned first to Azure's status page for answers -- then to Twitter to express their outrage that it took 30 minutes for Microsoft to even acknowledge the disruption. Still, that's an improvement over the company's performance two years ago during an Azure Storage outage, when customer dashboards didn't indicate a problem for about 1.5 hours.

After recognizing the Azure outage, Microsoft promised its next update ... in 60 minutes. When experiencing the failure of a critical service, an hour spent without a status update can seem like an eternity. Even worse, as Infosys reported a day before the Azure fail, "over 80 percent of large organizations use or plan to use mission-critical applications on cloud."

An inaccurate status page can be more infuriating than a dearth of information. After an "all clear" message from the Azure team reassured users,"All good! Everything is running great," Microsoft was still investigating problems that continued to affect Azure VMs in Europe. Or as Dirk Zastrow commented on the Azure blog, "the dashboard is useless when it's not showing authentic data."

Commenter Ray Suelzer added, "The fact that I have YET to get an e-mail apologizing for the outage ... and the complete lack of interaction with customers on Twitter is shameful. We understand that [outages] happen, but the human side of this could have been handled much better."

Of course No. 1 cloud provider Amazon is no stranger to outages -- particularly around the holidays. A power outage at an AWS data center in California brought down services for customers on Memorial Day this year. On Christmas Eve two years ago, Netflix customers snuggling up for a favorite holiday movie had their hopes dashed when the streaming service was brought down by an AWS employee error affecting its U.S.-East region.

While customers can take steps to weather a cloud outage, including configuring systems for resiliency by using multiple availability zones, cloud providers have ample room to improve their communications when handling the "inevitable" outages.

Next read this:

Caroline Craig is East Coast site editor for InfoWorld.