The bigger the project, the better the capacity test has to be. In 2008 a new flight terminal opened at Heathrow airport in the UK. It was named Terminal 5 and cost £4.3bn – projects don’t really get bigger than this. The first days after opening were not perfect. Problems with staff screening and technical glitches led to the cancellation of 34 flights, which made all the national news channels very happy. They reported the inadequate capacity of the baggage handling system, the misery of passengers, and pretty much anything else they could find to complain about.
Can the vast resources of cloud computing help us to more accurately model the capacity required for projects the size of Heathrow Terminal 5? Disruption to 34 flights and a few thousand passengers is a lot of misery, but just a drop in Heathrow’s ocean. You could call that a success. Will enterprises demand better capacity management tools in the cloud?
Checking capacity
I want to get an idea of the capacity required to run my application before it goes live. I certainly don’t want the wings to fall off my in-flight application. I want to blast it in a wind tunnel to see if anything gets blown off. I carry out these three tasks for my capacity test.
- Profile the system. List all the resources available and find out how much load the system is carrying before my application services customers.
- Generate a synthetic load. A generator imitates client usage. I want to imitate an average day, not see how far my system goes before it breaks.
- Monitor the system. Measure the increase in load when my application services the pretend-customers.
The tools used to check the capacity are the same ones used for many operational tasks, such as stress testing, performance monitoring, and ongoing capacity management. Some companies create their own test environment with shiny new open source tools like multi-mechanize, Selenium and JMeter. Others rent services from cloud testing providers such as Soasta, Loadstorm and Cloudsleuth. I sometimes use the venerable sysadmin command line tools of lshw, top, tcpdump, df and ab.
Checking the whole technology stack
Any cloud application is built on top of layers of technology.
- On top is my tailored business application and supporting off-the-shelf applications.
- The virtual machine and OS support the applications.
- The cloud provider’s OS and hypervisor run on the physical machine.
- The cloud provider’s hardware and network form the bedrock.
The cloud provider’s layers – the hardware, network and hypervisor – are hidden and cannot be measured by the customer. A chain is only as strong as its weakest link, and there are a lot of links in cloud infrastructure: fibre, PDUs, routers, switches, load balancers, proxies and many more.
The AWS console throws in a few graphs of CPU usage, disk reads and writes and network traffic for free. I would not want to base my ongoing capacity management reports on this information, but it’s fine for starters. AWS provides the Cloudwatch service for more sophisticated monitoring.
AWS monitoring (click to enlarge)
The virtual machine layer of the technology stack includes the disk, CPU and memory (that’s “hard disk, processor and RAM” for the old timers). These can be measured by my OS, and the OS will make the measurements available to my system monitoring tools. The same goes for the top layer – the applications.
Problems measuring cloud capacity
I have to believe the provider will be able to deliver massive elastic capacity. There are no measurements I can look at because no provider shouts about the details about their infrastructure. My belief will be tested in the next two years because, as more cloud providers appear and the rush to the cloud continues to pick up speed, service levels will suffer. Prices will be forced down and the ratio of virtual machines to physical machines will be forced up. I can’t check these physical lower layers. I can’t check most of the components I rely on.
My virtual machine measurements will be less accurate than the cloud provider’s physical machine measurements because of the actions of the hypervisor behind the scenes. The hypervisor chops and changes resources to manage demand. It is the air traffic controller landing many planes on a few runways, the check-in desk queuing up hundreds of passengers and the vehicle carrying inedible meals to the long haul flight.
If my application is idle and someone else’s application is servicing a busy call centre, the physical CPU and network bandwidth are not being used by me. If my disk has empty space and someone else is storing gigabytes of data, the physical disk fills up with their work. The system clock works in virtual time, not real time. The work of the hypervisor can cause latency, inaccurate readings, pauses and other little glitches in my measurements.