Enterprise Software

What the recent Amazon Web Services outages mean in our own cloud journey

What does the recent Amazon Web Services outage mean to the average administrator considering a cloud technology? IT pro Rick Vanover thinks it means a lot and shares his thoughts in this blog.

To say that it was difficult to miss the recent issues with certain Amazon Web Services (AWS) cloud-based services recently would be an understatement. Starting April 21, a number of AWS services had a series of service interruptions and performance issues. The most obvious indicator was that a number of social media services, which leverage cloud technology, were interrupted.

Before we start to jump to any conclusions, let’s zero in on what happened. Early on April 21, status reports started showing up on the AWS status console of issues with a number of services. The affected services were Amazon CloudWatch (N. Virginia), Amazon Elastic Compute Cloud (EC2) (N. Virginia), Amazon Elastic MapReduce (EMR) (N. Virginia), Amazon Relational Database Service (RDS) (N. Virginia), AWS CloudFormation (N. Virginia),and AWS Elastic Beanstalk. As of the time that I am writing this blog (late Sunday night April 24), all services are back online except for a "limited number of customers" and each service is back to a green status with only a few notes.

The hidden issue lies with the Elastic Block Store (EBS) volumes, however. Many of the status updates for a number of the services make mention of EBS volumes; yet EBS itself doesn’t have its own entry on the status page. The primary use case for EBS volumes is be provisioned directly to EC2 instances as block disk resources. The ECS2 instances are effectively virtual machines on the Internet hosted by AWS.

This compound series of service interruptions tells us a few things. First of all, failures happen to both big and small datacenters. This is because in the end, the AWS services are run in datacenters. Chances are, it is nothing like the datacenters you and I have worked in; but nonetheless it is a datacenter. The second thing that this tells us is that if we design a service for the cloud, we need to be ready to accommodate an outage. This should sound eerily familiar to what we have always done in the datacenter: Architect around domains of failure.

Is the lack of a true cloud standard the issue?

Not necessarily. Federated clouds sound good, but in practice are different point solutions that I rarely see real-world use cases that leverage two public clouds for one solution. A more realistic approach would be to leverage the same public cloud, such as AWS for multiple independent cloud infrastructures using regions (discussed in a bit). The fact is that AWS is still the most refined offering of public cloud services, and it is successful; in spite of this incident. Further, I think that it will continue to be the most refined offering and recover from this incident.

The good news for Amazon is that all of the affected services, with the exception of Elastic Beanstalk, are available in other regions such as Northern California, Ireland, Singapore and Tokyo. Further, within each of the regions; there are specific availability zones. The Northern Virginia AWS cloud has four availability zones for EC2 for example. The fact that AWS is distributed is probably the best thing going for it. For this specific incident, a number of availability zones were impacted in the Northern Virginia region; also referred to as US-EAST-1 in the status reports. This means that if a cloud solution was split across regions and did not require Elastic Beanstalk; it may not have been impacted. Cluster a la cloud, if you will.

If we are to architect cloud solutions around multiple domains of failures, then the best approach would be to leverage two AWS regions. This in theory sounds easy, but in reality may be quite complicated. First of all, the pricing differs for each region. Secondly, any transfers to other regions incur a bandwidth cost. Transfers within a region are free. So, transferring data from US-EAST-1A to US-EAST-1D is no cost; but transferring that same data from Northern Virginia to Northern California would incur a transfer cost. Keeping in mind that the data and systems in the cloud are ultimately our own, we do need to take it upon ourselves to plan for these types of things if we don’t want to endure an outage.

For the naysayers: You told us so, right?

Surely there are bloggers and opinionated individuals enjoying the incident and shouting, “I told you so!” The fact is, if we don’t architect for domains of failure properly in our own datacenter, how are we going to do it in the public cloud?

What we have learned from this incident is that failures happen; how we change our behavior is a token to how well we learn from our mistakes – even if we weren’t impacted by this incident. What do you take of the AWS incident? Share your comments below.

About

Rick Vanover is a software strategy specialist for Veeam Software, based in Columbus, Ohio. Rick has years of IT experience and focuses on virtualization, Windows-based server administration, and system hardware.

31 comments
tgp5
tgp5

There is no question that Amazon's downtime and slow response to questions has given the entire cloud industry a bit of a black eye. However, the movement to private/hybrid/public cloud computing environments is not being driven by IT; it's being driven by CFOs. The economics are just too attractive. Enterprises will want to cost optimize their cloud infrastructures with next generation chargeback, organizational mapping down to the end user, granular views of workloads and business analytics of their entire infrastructure. It's for those reasons that startups like Cloud Cruiser have been funded by the investment community. There's no going back.

gechurch
gechurch

I've just read through Amazon's post-mortem analysis of the issue (http://aws.amazon.com/message/65648/). It makes for a very interesting read. Basically human error caused some network problems, and code to recover from this tried aggresively to replicate data elsewhere for redundancy, and this caused massive contention on the already troubled network. Then there were more code problems found. The code was fine under normal operation paramaters, but when the network and replication problems happened the code was making dumb decisions. I feel rather bad for the Amazon techs and developers. What they are dealing with is neccessarily very complicated beta code. It's got to be impossible to test these sort of conditions preventatively. The options are to duplicate their entire data centre and test with that (far too costly), or to do planned testing on their production equipment (which would have failed in the same way). The fact is automating massively redundant data centers like this is a fairly new endevour. No human can possibly understand all the implications of this type of code, or imagine all the ways in which equipment and code might interact to cause these sorts of problems. All they can do is try to be incredibly cautious and redundant in their code, but in this case that cautiouness and redundancy added to the problems in unforeseen ways. There doesn't yet exist a human that can write complex code perfectly first time. Until such a human (or robot) exists, the best we can do is continue to be cautious, continue to test, and to fix bugs as they are found. The most pertinent quote to my mind - "We now understand the amount of capacity needed for large recovery events...". The reality is they're still learning. This is the reason why, as an IT worker in a small company, the cloud doesn't make sense. The cloud is an intricate, complex environment. Why would we risk exposing ourselves to downtime due to this complexity, when it has nothing at all to do with our business needs?

SaintGeorge
SaintGeorge

I haven't been paying special attention to clouds. At the moment, they won't add much value to our operation, while they would add costs. "Architechting" around - I liked that verb - implies maintaining an infrastructure able to quickly take over after a cloud service crash. So, it means mirroring in real time the services and information we'd have in (on?) the cloud, maintaining fast access lanes, and retaining the people with the know-how to operate them - all of this, just in case. Plus the procedures to switch over, which I seriously doubt means to just press a button. What am I really saving, then? I'm just adding a layer of knowledge/difficulty/cost to our current setup. Clouds, as you point out, are just pumped-up data centers. There is neither new technology nor new procedures involved. It's just more memory, processing power, broadband and power lines. The pitfalls may be farther apart, but they are immense. Of course one of my servers could crash. But if my cloud provider does, or I lose access to it, it's as if my WHOLE datacenter crashed. And what is different from, say, one or two years ago? Nothing. Just a business solution turned viral. So far, it's only marketing and a lot of promises (OK, I know, those two are the same). And of course, I've been around computers for 30 years now, and I have seen lots of shiny pretty things come and go, or come and stay, somewhat duller and plainer but far more trustworthy, after refinement. I have no doubt clouds are the future. There will come a time when clouds will be as ubiquitous and dependable as electricity or running water. But right now they are just growing up and I don't think I should be the one finnancing their education and rite of passage. Give me a cloud provider who will pick up the tab of my downtime if or when he fails, and I'll join the happy crowd. Till then, I'll lean back and wait. By the way, do you remember outsourcing? I have a full shelf of books about how the final solution had arrived. I've consulted for several companies who transitioned blindly (against my advice, which was the usual: baby steps) and then... Then I have one book whose title says it's all: Insourcing after Outsourcing. There is nothing new under the sun. People will always rush to the latest fad. We must have a common gene with lemmings. Caution! is the operative word. Audacity might pay once in a while, but more often than not it is the entrance to medium and small business cemetery. Pause. Look around. Wait and see. Then adopt. Till then, you will be more likely just paying bills and mortagages for vendors and gung-ho consultants while your business sinks to the notes of Taps. On a more personal note, I have also had my share of commentators looking down upon "naysayers" and "opinionated individuals", who are just a general term to include anyone who disagrees with their usually narrow minded view of the world. Rick, I don't share your opinion. Doesn't mean you are stupid or dumb or just probably trying to use this blog to coerce your clients to buy your solutions. And even if I thought any of this was the case, I wouldn't try to use it to beat down your argument. Because when you try to get the upper hand in an argument without arguments, but through insult, well, you become an asshole yourself regardless of the merits of your idea. To wrap it up, I see clouds in my future. But for now, it's just foggy weather. Jorge. Buenos Aires, Argentina.

HAL 9000
HAL 9000

Seeing what the outcome of the PS3 Crack causes with a already admitted 70M users affected by their Data being accessed. Got to love Sony here advising their users to change their Passwords. :D Col

b4real
b4real

I'm still waiting for their official response. I'm really curious to the story!

jhinkle
jhinkle

I'm still a little lost on what cloud computing is, as far as I can tell it's no different than the years I spent managing 3rd party servers in a colo rack except you're not actually paying for the server. You're just paying someone else to have a virtual machine and/or services to access over the internet on their equipment. One of the biggest problems we ran into with this kind of work is that the customer is focused on their server and services, while you as the group housing the equipment are only concerned with their equipment having power and good up time. When a problem actually occurs it creates a lot of overhead and billable hours to understand what the customer is actually doing and what needs to be done to fix it. In a lot of cases even your more tech savvy customers (and I'm assuming that cloud computing will draw more IT people) don't really know what's going on with their data or how it's actually managed and won't look at it until something goes wrong. Under no circumstance should you expect anyone other than yourself and your IT staff to care about what your data is and what it does, the focus of these groups will be solely on managing virtual machines and services, everything else is user responsibility. There are probably some instances where having your data colo'd somewhere else will save money and time but I believe that ultimately you as the Admin of your network and equipment are responsible for your data. You should manage your own equipment w/service contracts, make regular backups and verify they're working on a scheduled basis, and regularly review your equipment and cycle it out when it reaches a pre-determined end of life period (where possible, budgets are always tight). No matter how much money and rack space you may save by colocating your data and equipment you are handing over control of your responsibilities to someone else and creating more overhead for problem resolution and systems integration.

Derteufel
Derteufel

leverage, leverage, leverage?

VBJackson
VBJackson

I hear a lot of peple pointing the blame at one group or another, but my (not so) humble opinion this should not be a surprise to anyone. The problem isn't just that cloud services have several "single points of failure". Even if you discount or have contingecy plans to cover problems like loss of access (ISP issues, internet connectivity failure), datacenter issues (like this), and loss of control over both your applications AND your data, you will still have other problems to contend with. The IDEA of loosely-coupled, resilient service-oriented architectures is great, but like any other architecture, it has it's own set of determinent constraints. To be resilient, EVERY service in the solution has to be redundant. This means that both the solution architect and the provider have to be aware of all the interactions between components, storage and data providers, and the phisical/virtual machines and architecture that are running them. Not only are many providers unable to spend the time and resources needed to provide this level of comprehension, they certainly don't have the people needed to do this for all of thier customers. In fact, it is my experience that a large number of architects that are trying to design larger scale SIO infrastructures are running in to problems in this area themselves. So my conclusion is that in turning from an internal datacenter to a cloud provider, what you are actually doing is trading a dedicated staff that can be trained on the details of supporting you infrastructure and application suite, and a set of servers that can be designed and architected for the levels of service and of redundency required, for a staff that only knows about the infrastucture of the datacenter as a whole, and that you HOPE is trained on the details of the virtualization components that compose thier cloud offering.

PalKerekfy
PalKerekfy

In theory, the larger a system is, the more reliable it is. In theory??? They can afford more redundancy, better equipment, and more skilled people. But then, why it is not the case? Is it price pressure? Are they too detached from their customers' business, and don't feel the pain? Is it too much structure and too long reporting lines in the large service provider organisations? Separation of dueties and responsibilities, too many internal interfaces, and then "It is not my job, it is their responsibility"?

kentravis
kentravis

Mark my words, the so called ???cloud??? will have many more failures. I am not going to trust ???them??? any more than banks, oil companies, politicians, or any one else that stand to make a buck without oversight and transparency, no mater what "they" tell me.

Trs16b
Trs16b

Nice idea in an imperfect world. The cloud will be adopted by many but not by CIOs that understand that reliability is the #1 goal of any IT department. When the data stops flowing, companies grind to a halt. Tat costs money. BIG MONEY. I've personally seen a CIO have to try and explain to a CEO that he has no idea why something has crashed and no one can work. Nor did he have any idea when things would be fixed because they couldn't get to anyone at the vendor who had a clue. The people with a clue were scrambling to figure out what was going on. Cloud services will save you money in the short term. Is it worth it?

oldbaritone
oldbaritone

That's one of "the cloud's" big selling points - let the professionals in high-reliability high-availability systems provide you the same high-reliability data center that you couldn't otherwise afford. Oops, AWS. BIG OOPS! Any business that was knocked down by this outage should hear the wake-up call. Hello? And it doesn't need to be the source provider, either. An ISP outage will be just as devastating to a business as the AWS outage was. If you can't get to your data, ruh-roh. You're out of business. Any system with the potential for single-point-failure giving such catastrophic results should be built with fallback and backup plans. There should be "what if" scenarios in place to make conscious management decisions about what the enterprise will do during an outage. Critical data centers have backup power generation. They also have enough UPS to run them on batteries until the generators come online. There is also enough surplus UPS power to run them for a while if the generator doesn't start the first time. There are multiple routes to the same location, Two of my customers have a private ring network that can be traversed in either direction, with internet/VPN backup access. They stay up, running a little slower each time, through three consecutive failures. Segments are taken down routinely for PM, and the affected sites switch to "Long route" instead of "Short route" (which way around the ring) automatically. Once a month they test the VPN for two hours. Without reliability, they're out of business. Reliability takes planning, and costs money. But being out of business costs a LOT of money. Customers don't pay for downtime, and they expect reimbursement for their excess costs. In short, as usual: Don't believe all of the sales hype. When you get down to brass tacks, your business is your business. Nobody cares more about it than you do. Make contingency plans for smooth transitions in case of critical failure. That's just good business.

russellsutherland
russellsutherland

No way! They will never be as stable or as secure as housing the data yourself. When your data leaves your network you no longer have control of it. Google has been hit. The Credit companies have been hit. Why put your data on such a large target?

russellsutherland
russellsutherland

Cloud computing is still in it's infancy. Untill the SaaS providers can design, build and sell truly redundant data centers that auto fail over we will continue to have these issues.

TheShawnThomas
TheShawnThomas

How much of the Amazon cloud is used for other purposes as well that are no longer available when they go down? Like Playstation Network that has gone down for the last several days. Besides no online games, you also can't access other content that requires the PSN connection like Netflix, Hulu Plus, Sony Music Unlimited, etc. seems like too many eggs in one basket.

pgit
pgit

If you're going to offer those types of services you'd think they'd have at least two other physical plants, separated by a couple thousand miles. When the equation is cost versus the potential for inaccessible data, perhaps permanently so, which way would you go? I'd make absolutely sure I had overkill hardware redundancy and a failover setup with one of the backup clouds constantly synced and another semi-isolated cloud that syncs on a schedule. (not constant parallelism) It would monitor the status of the main system (a la nagios) and not sync if there's anything suspicious. (and alert admins, too) With this setup a glitch in any back end isn't going to take down all your mirrors with it.

tr
tr

How could a major cloud breakdown affect legal issues? In Sweden, some type of information (like personal information) need to comply with the personal data act and is not allowed to be managed outside Europe. Could a major breakdown in, say, Ireland, cause an emergency fail over to US or Asia with no regards to legal demands? If "all" hosted services are on the line, could that be a realistic scenario?

PalKerekfy
PalKerekfy

Yes, Jorge is right about caution. Small steps in outsourcing, this is what we have been doing for a few years, and this is our approach to the cloud as well. By the way, "cloud" is not very different from the large internal systems - just the scale and the ownership are different. If we start using "cloud" for any kind of critical systems, it should be at least as secure, safe and redundant as our current systems. Can they make it cheaper, safer, etc.? If they can, I seriously consider it.

b4real
b4real

As a big overhead factor to truly enable cloud computing.

b4real
b4real

The cloud that is. We should leverage public clouds, and design for failure.

gechurch
gechurch

These are all good questions, and these hit at the heart of my problem with the cloud. In practice all the factors you have mentioned could be at play, and could stop the reliability and uptime that should be seen from being reached. And the real problem - as and end-user, how do you know which cloud providers really have their act together, and which ones have in-fighting and red-tape? The answer - you don't. You've also hit on my biggest concern with the cloud - what happens when it does go down? I've got no doubt that Google and Amazon and the like can maintain servers far better than I can, and can achieve higher up-time. But if we have an outage, I know I will get it fixed as soon as possible. I will work through the night if neccessary because, as you say, I feel the pain. When there's an outage at a cloud provider, everyone wants their services restored immediately. Why would Amazon prioritise getting my stuff working again over any of their other thousands of customers? They wouldn't. To my mind, if it's my responsibilty to keep data available at my company, I won't consider outsourcing the job to a cloud provider. I get the blame when things go down, and if I can't do something about it it's my ass on the line. Saying "it's not my fault - it's Amazon's" won't mean much".

b4real
b4real

Would you still be a hater?

billj
billj

Everything needs multiple backups, because of possible failures. Our local servers located physically in our home office were attacked. All local data was lost, but we were able to continue business from a cloud server. Hearing people talk about keeping servers locally reminds me of keeping cash in the house. Its less safe to keep valuables locally. Besides we are forced to grow and use this technology to stay competive. Still I will verify the backup of my cloud computing.

b4real
b4real

It makes sense for delivery - but we are the owners of our data, code and other intellectual property. We have to be stewards of the data.

b4real
b4real

If we need to depend on it to that level, we should incorporate that into the design of what we put into the cloud.

b4real
b4real

Good luck getting the answer to that. I know HootSuite and FourSquare were impacted as well. I'd love to see the story of any services that have accommodated for a region failure and survived the issue seamlessly.

b4real
b4real

But, I think we need to plan on a region failure.

b4real
b4real

As the US is the only country where two zones exist. That would be an issue for some GRC scenarios.

gechurch
gechurch

Can't we do that already? We can put multiple Internet links in our offices, presumably we can have local cache servers, and as mentioned in the article we can purchase redundant cloud services. To my mind the pertinent question isn't "Can we design around cloud failures?", it's "By the time we've built around these potential failures, is moving to the cloud still an attractive prospect?".

gechurch
gechurch

When having these discussions I always think of "the cloud" as "a public cloud". The concept of a private cloud is a bit vague in my mind... what qualifies? A couple of VMs with Vmotion set up? I'm not sure what the consensus is of "what is a cloud", but I'd think rapid provisioning, automatic failover, a large number of servers, and redundancy to protect against ALL single points of failure would be in many people's mind. The rapid provisioning and automatic failover are the trickier aspects (redundancy isn't necessarily hard, it just tends to be costly). These trickier aspects are where the complexity comes in, and that's what caused the massive problems at Amazon. Is this really better that the mainframe days? In most aspects it is, except for the severity of the impact when something does go wrong. I'm sure the larger companies like Amazon will learn from these problems and create a more mature product. The secret is not to be a beta-tester for them.

HAL 9000
HAL 9000

On who owns the [b]CLOUD[/b] I suppose. After all if it's an Internal Company Cloud other than remote offices there isn't much of a problem. However when Business starts to rely on Public Clouds everything changes for the worse. ;) However under those conditions other than the name I don't see what is different from the Old Main Frames of yesteryear and the Cloud of today other than the terminals are not quite as dumb as they used to be. Col