Learning from the cloud's failings

Last week's Windows Azure Compute outage underlines customers' blind faith in infallible cloud solutions

This past week saw another high-profile cloud infrastructure failure as Microsoft's Azure Compute service experienced an extended service outage. As in previous cases, such as last April's Amazon EC2 outage, this event caused another round of hand-wringing about the future of the cloud and the wisdom of using it.

Each time something like this happens, however, I'm always nonplussed that anyone is actually surprised.

Yes, megascale cloud implementations such as those fielded by Amazon, Microsoft, and Rackspace are designed with many layers of overlapping redundancy to prevent this sort of problem, but why is anyone shocked that some combination of unforeseen circumstances could lead to widespread failure? Far be it from me to defend Microsoft here (I mean, leap year -- really?), but you cannot design a perfect, infallible system. It simply doesn't exist.

Much of life in IT revolves around planning for the things that we build to fail -- and how we'll cope when they do. Working with the cloud is no different. Just because someone else is investing millions of dollars in creating an ultraredundant and hyperscalable infrastructure doesn't free you from having to construct your own game plan for keeping your services available when -- not if -- there's a failure.

What's missing, though, is a reasonable and objective measure of cloud computing provider availability. All we can do now is search through the news archives to see when major failures have occurred. We don't really have a comprehensive, third-party benchmarking and grading scheme to help prospective customers evaluate which cloud service platforms have been the most reliable. The meaningless SLAs offered by cloud providers aren't going to help you either.

But guess what? We have a third-party benchmark for cloud storage providers. Not long after the Azure failure started, Andres Rodriguez, CEO of cloud storage appliance vendor Nasuni, pointed out that the Microsoft failure had not affected Azure's cloud storage platform -- it remained up and untouched throughout the compute service outage.

As a company that sells a product that integrates with many CSPs (cloud storage providers), Nasuni is a large customer of just about every major CSP in existence. Since 2009, Nasuni has been performing almost constant monitoring of all the CSPs it integrates with or has considered integrating with, which has culminated with publishing of a report of its findings (no registration required).

Although Nasuni shares only the results of its testing, not the methodology or the raw data, the information in the benchmarking report is helpful to anyone considering a cloud storage platform. The most important takeaway: Even the most reliable CSP that Nasuni monitored, Amazon's S3 platform, registered an average of 1.4 outages per month. Granted, they were quite short, resulting in an overall statistical uptime of 100 percent -- but they were outages nonetheless.

I'd love to see this kind of benchmarking done for performance and availability of all major cloud service providers, both storage and compute. But the bottom line is this: Don't be surprised when cloud outages occur -- and have your own plan to deal with them just as you would for an on-premise solution.

This article, "Learning from the cloud's failings," originally appeared at InfoWorld.com. Read more of Matt Prigge's Information Overload blog and follow the latest developments in storage at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Copyright © 2012 IDG Communications, Inc.