I have to test one of my operational principles: “My enterprise service is resilient”. Do I have any SPOFs (Single Points Of Failure) with my new Drupal service? Absolutely. It’s a single machine. It’s one big fat SPOF. I have a definite lack of RAID, BGP, ECC and other obtuse acronyms that help protect against bad customer service.

I need to add another EC2 machine, to make a pair. First I explain why, then I add a second VM and load balance the pair.

What is the problem with using one machine?

One EC2 machine is fine for many Internet services. If the machine is used lightly there is a good chance no-one would even notice if it disappeared for a few minutes. However, some services are mission critical and all should be monitored. I cannot offer a competitive uptime SLA if a service, running on a single VM, keeps clocking up the downtime.

When people think of what may make an application fail, the first think they think of is something breaking. It’s actually more common to switch it off on purpose for maintenance. A computer may be switched off before it is moved or its hardware is replaced. An OS may need an upgrade, then a reboot to clear out the memory (I’m looking at you, Windows). A database server may be stopped so the data is consistent before it is backed up, disabling all the services that rely on it.

I got this message from Amazon:

From: Amazon Web Services <no-reply-aws@amazon.com>

Subject: Amazon EC2 Maintenance – Reboot Required [AWS Account: 123456789012]

Dear Amazon EC2 Customer,

One or more of your Amazon EC2 instances have been scheduled for a reboot in order to receive some patch updates. Most reboots complete within minutes, depending on your instance configuration. The instances that will be rebooted are located in the region(s) listed below.

Region:

========

EU West (Ireland)

Sincerely,

Amazon Web Services

This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210

My service will disappear for an unknown number of minutes while AWS do their maintenance.

I need another VM to take the strain. By creating a second machine to provide customer service and sticking it on a different physical machine, I have taken care of doubling up on network and hardware, and have removed all the important SPOFs.

Cloning AMIs

To create my second VM, I could build a new one from scratch: I could launch a new Amazon basic AMI and repeat the Drupal install procedure. Which would take a couple hours. Or I could copy my original VM, which would take a couple minutes. It’s a no-brainer.

I can’t copy my machine while it is running. A running virtual machine is a complex beast. A virtual machine like a database server is always changing. Information flies back and forth – being added to memory, standing in a disk write queue, being moved from one buffer to another and so on. Copying a running VM can lead to broken data and a useless copy. The safe thing to do is copy the AMI instead. The AWS AMI – the template copied to make VMs – is a collection of files which are trivial to copy.

  1. Turn the VM off,
  2. Copy the AMI files, then
  3. Turn the VM back on again.

My Internet service is not available to clients for a couple minutes while this happens.

I use the AWS management console to request a copy of my original AMI. The console passes my request to a web API and the API talks to an AWS application behind the scenes, which takes care of the heavy lifting.

It’s easy to use the console for a few manual tweaks here and there. If I was dealing with industrial quantities of AWS services this would become an impossible task, and I would instead use helper programs to talk straight to the API. Facebook, Wikipedia and iCloud technical managers do not employ armies of people to control their vast quantities of VMs.