Amazon outage one year later: Are we safer?

Experts warn that many businesses still have not taken measures to insulate themselves from big service provider outages

Amazon Web Services last April suffered what many consider to be the worst cloud service outage to date - an event that knocked big name customers such as Reddit, Foursquare, HootSuite, Quora and others offline, some for as many as four days.

So, a year after AWS's major outage, has the leading Infrastructure-as-a-Service and cloud provider made changes necessary to prevent another meltdown? And if there is a huge repeat, are enterprises prepared to cope? The answers are not cut and dried, experts say.

[ Also on InfoWorld: IT engineers ponder fix to dangerous Internet routing problem. | In the data center today, the action is in the private cloud. InfoWorld's experts take you through what you need to know to do it right in our "Private Cloud Deep Dive" PDF special report. | Also check out our "Cloud Security Deep Dive," our "Cloud Storage Deep Dive," and our "Cloud Services Deep Dive." ]

In part, it's difficult to answer these questions because AWS is notoriously close-lipped about the inner workings of its massive cloud operations, which not only had an outage last April, but suffered a shorter-lived disruption in August. What's more, it's hard to get a read on individual cloud customers' private plans, although industry watchers such as IDC analyst Stephen Hendrick say many enterprises have a long way to go to be fully isolated from provider shortfalls.

"Some folks had their bases covered, for others, it hit them pretty hard," says IDC analyst Stephen Hendrick, recalling last year's AWS outage. "There are certainly lessons to be learned, the question is whether customers want to do what it really takes to protect themselves."

First, a recap of what happened last year: A few weeks after AWS's outage, the company released a post-mortem report detailing what caused the disruption and steps the company took immediately following. Basically, human error started the chain reaction of events. In the wee hours of the morning of April 21, 2011, while attempting to upgrade the East Coast region of the company's Elastic Block Storage (EBS) service -- a storage feature that links in with the company's Elastic Cloud Compute (EC2) offering -- part of the EBS network was switched to a lower capacity infrastructure that wasn't prepared to handle the traffic of the EBS system. The EBS nodes attempted to rectify the problem themselves, causing a network traffic jam that soon spilled over into another AWS feature, the Relational Database Service (RDS), another log storage offering. In all, about 13 percent of the EBS nodes in the affected area were impacted by the outage, and after the four-day event, 0.07 percent of the impacted data was permanently lost.

Experts say AWS has made improvements to its system since then, but it's unclear just how substantial those are. For example, in the post-mortem report the company says it audited its change process and increased the use of automation tools when making updates to avoid human error. Drue Reeves, a Gartner analyst who tracks the cloud industry and AWS, says the company has boosted its primary and secondary EBS networks to handle high network capacities. "It's made EBS more resilient," he says. "They've taken some steps to rectify the situation to make sure this instance doesn't happen again, but that doesn't mean we won't have other outages."

The company says it has taken steps to ensure that problems in one area don't spill over to bring down other services, and Amazon says it has made it easier for customers to build fault-tolerant systems using AWS products. The company's secretive nature related to the architecture of its cloud operations make it difficult to assess security vulnerabilities though, Reeves says.

AWS promised to be more forthright about outages in the future. During last April's downtime, hours after the event, the company said only that a "networking event" had occurred, frustrating many customers who wanted information on what had happened and estimates on when services would be back up. A spokesperson for AWS wrote in an email that the post-mortem report details a number of changes the company has made: "This included software fixes as well as new features including EC2 Instance Status Monitoring and EBS Volume Status, which provide customers the information they need in order to understand the full health of their resources running in AWS."

The company has also stressed what customers can do themselves to protect against outages. At the crux of this is the idea of availability zones (AZ). AWS has eight global regions where customers can store data, including the U.S. East Coast region, where this outage was centered. Then there are availability zones within each region, which are physically separated, independent infrastructures meant to allow for high availability of data. In what amounted to a polite 'I told you so,' AWS makes clear in the recap of the outage that customers who backed up their data in multiple AZs were less affected by the outage. For customers using the Amazon Relational Database Service, 45 percent who used a single AZ were impacted by the outage, but only 2.5 percent of customers using multi-AZs were, according to AWS. A series of whitepapers and webinars have been sponsored by the company since the outage advising customers of how to architect multi-AZ systems.

Each customer has to do his own risk assessment to determine how much should be invested to ensure high availability of services, Hendrick says. Having copies of data across multiple availability zones can increase costs of using AWS's cloud, he says, perhaps by as much as 50 percent. "I don't know how many enterprises really appreciate what high availability means," he says. A rule of thumb is that the more mission-critical applications and data in the cloud are, the more attention customers should pay to architecting for high availability. For some customers, just a multi-AZ approach isn't enough. "We always store data in multiple zones to avoid [outages]," Jeremy Edberg, senior product developer at Reddit was quoted as saying last year. "The reason it went down is that it failed in multiple zones."

A strategy that goes beyond using a multi-AZ approach within the AWS ecosystem involves storing data across regions, says Reeves, the Gartner analyst. This can be somewhat more complicated because while AWS allows for common APIs between AZs within a single region, separate API calls are needed for using multiple regions. This is the approach Quora has taken since last year's outage. On the question-and-answer social media site, a Quora engineer notes that the company is using a "cross region database replication" strategy to distribute its content.

There are nontechnical steps customers can take as well, Reeves says. Customers can negotiate with providers to guarantee availability and stipulate consequences for not meeting those in service-level agreements (SLA). In AWS's outage case, any customers using the impacted availability zone received a 10-day credit, regardless of whether they experienced downtime.

Still though, there will always be risks and outages, and in a way, Reeves says that's a good thing: It keeps companies and end users on their toes. "I really believe that outages can propel the cloud further rather than hinder it," he says. "If we learn from these mistakes, both customers and providers, and make our systems safer and more secure, that can be a good thing for the industry as a whole."

Network World staff writer Brandon Butler covers cloud computing and social media. He can be reached at BButler@nww.com and found on Twitter at @BButlerNWW.

This story, "Amazon outage one year later: Are we safer?" was originally published by Network World.

Next read this:

Senior Editor Brandon Butler covers the cloud computing industry for Network World by focusing on the advancements of major players in the industry, tracking end user deployments and keeping tabs on the hottest new startups.