I built a Drupal installation on Amazon EC2, and now I want to find out how scalable my new Drupal service is. To do this, I am going to run my first set of load tests (AKA stress tests), to push my service to the limit.
When more requests arrive at my website, how does it behave? Does it handle the increased load without affecting the performance? Does it remain reliable?
Right now, I don't know. I cannot offer a fully operational cloud service without knowing how it scales, so I need to kick off a few tests.
I use a few free tools to torture the homepage of my new customer service, rather than using a commercial service like Loadstorm. I put my service under increasingly heavy loads to see what happens and measure the results.
Later, I will need to fix the problems I find.
This service is born to lose
One small Amazon EC2 machine is weak. A small physical machine may handle a reasonable web workload, but a small EC2 machine won't. These tests illustrate why you need to be able to scale up (use a bigger machine) and scale out (use many machines).
I know my service will perform poorly. It's so poor it can't even afford to pay attention. This is how I ensure awful service.
- I use a small EC2 machine type. Apache VMs come in different sizes, from micro to massive. A small VM has few resources, which makes it a tight fit for a web service.
- I only use one EC2 machine. I have two identical VMs to share the work, but I run the tests on one machine only.
- MySQL has not been tuned for Drupal. I've made no changes to buffer size, not hunted for slow queries, and run no engine checks.
- I don't use a cache. Caching using an application like memcached or varnish is a popular way of increasing speed. Just turning on the Drupal cache speeds up response many times.
My load testing toolkit
I generate and monitor the extra load using a few applications. These are all command line tools, so they are not very intuitive.
- top (top process statistics). This is a process monitor that shows me what is happening to the system.
- vmstat (virtual memory statistics). This is like top, but displays information in a different way.
- ab (Apache HTTP server benchmarking tool). This is a web site load generator. I use ab to give my service an increasingly hard time.
top (top process statistics)
The top command is one of the ten most useful Linux commands. Top displays information about processes and what they are doing to the system.
I want to get a baseline by using top while the system is idle, before I run the load tests.
top - 17:12:40 up 15 days, 3:33, 2 users, load average: 0.46, 0.69, 0.34
Tasks: 71 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1737564k total, 1117352k used, 620212k free, 168248k buffers
Swap: 3020212k total, 0k used, 3020212k free, 518672k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 8356 800 672 S 0.0 0.0 0:12.68 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:01.08 ksoftirqd/0
Understanding the numbers is difficult. The top command fills up my command line interface with a lot of data packed into two halves.
- The top half - about six rows - is a dense display of information about the state of the system, such as how busy it's been, memory used and uptime.
- The bottom half is a list of the top processes. They are ordered by how much CPU they use, with the biggest CPU hog first.
The procedure for using top is pretty straightforward.
- Open a CLI.
- Run the top command. A display like the one above appears.
- Watch the numbers. Every few seconds a few of the numbers change.
- When you've had enough, type the letter q to quit. The command prompt appears.
- Close the CLI.
There is a lot of information here: it is compressed to pack a lot into a small space. The more you use top the more numbers you can understand. It's a bit like staring at a stereogram until a 3D picture appears.
My first measurements
Even before I run my first load test, I can make some useful observations about my EC2 machine.
In the example above I see an idle system. The CPU is 100% idle and no swap space is being used. It's pretty obvious to a system administrator that this EC2 machine is doing nothing.
The load average is 0.46. The load average is an estimate of how much the box is doing compared to what it can handle - 1 is roughly 1 CPU working flat out, but keeping up with its workload.
Strangely, this idle box is putting in about half a server's worth of effort. Shouldn't a box doing nothing have a load average of zero? What's happening is the hypervisor is stealing my VM's capacity and giving it to other busier (and maybe higher-paying) customers. It's the same theory that an airline uses when it over-sells seats on a plane, relying on some passengers to not show up.
Nick Hardiman builds and maintains the infrastructure required to run Internet services. Nick deals with the lower layers of the Internet - the machines, networks, operating systems, and applications. Nick's job stops there, and he hands over to the designers and developers who build the top layer that customers use.