Load-testing a web service: How to interpret what you see

Nick Hardiman performs load-testing on his AWS EC2 machine to see how his web service performs under increasing loads.

I'm load testing my service. I use the CLI (Command Line Interface), not the AWS console. These are the programs in my load-testing toolkit.

In my last post on the basics of stress-testing, I described why I want to find out how scalable my Drupal installation on Amazon EC2 is, and gave a run-down of the top command. Now I am going to torture my new service with five increasingly unpleasant tests.

Don't try this on a production web server. Some tests temporarily cripple customer service.

You won't increase your bill by testing your EC2 machine. There is no extra charge for data transfer, CPU usage, or using these command line tools.

vmstat (virtual memory statistics)

The vmstat command prints a record of what's happening with memory, swap, CPU and so on. I use the command vmstat 1 to print a line every second. The array of numbers that is displayed takes some getting used to. At first glance, it's just as confusing as top's display.

root@ip-1-2-3-4:~# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0  47880 1299004   4496  84336    4    2     4     4    5    9  0  0 100  0
 0  0  47880 1298996   4496  84336    0    0     0     0   21   20  0  0 100  0
 0  0  47880 1298996   4496  84336    0    0     0     0   27   38  0  0 100  0

vmstat makes it easy to see what the swap situation is. Swapping is something an OS does to make room in memory when it's full. The OS temporarily stores some memory contents on the disk. And when a system starts swapping, everything slows to a crawl.

Some of these numbers, such as the CPU percentages, overlap with what top tells me.

This process is similar to using top.

  1. Open a CLI.
  2. Run the vmstat 1 command. A display like the one above appears.
  3. Watch the numbers. Every second a new line appears.
  4. When you've had enough, type the key combination ^C to quit. The command prompt appears.
  5. Close the CLI.

The ab load generator

The ab tool is a website load generator. I use ab to give my service an increasingly hard time. It's a little like siege.

I generate a light load using the ab command. This runs for a minute or two and prints another humongous set of numbers. The details below show what ab displays when I request the homepage at http://localhost/drupal7/.

I make sure I do not use the network by running commands on the EC2 machine, and using the localhost interface of my web server. Using the domain meant for customers, www.internetmachines.co.uk, goes via the load balancer, and I don't want that. Using the machine's public domain name, ec2-1-2-3-4.eu-west-1.compute.amazonaws.com, may get me a data transfer charge. I could use the machine's private domain name, ip-10-2-3-4.eu-west-1.compute.internal, but I'm too lazy to type it.

The command: ab -n 100 -c 1 http://localhost/drupal7/ is used in the first test.

root@ip-1-2-3-4:~# ab -n 100 -c 1 http://localhost/drupal7/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient).....done
Server Software:        Apache/2.2.16
Server Hostname:        localhost
Server Port:            80
Document Path:          /drupal7/
Document Length:        7616 bytes
Concurrency Level:      1
Time taken for tests:   33.207 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      808600 bytes
HTML transferred:       761600 bytes
Requests per second:    3.01 [#/sec] (mean)
Time per request:       332.074 [ms] (mean)
Time per request:       332.074 [ms] (mean, across all concurrent requests)
Transfer rate:          23.78 [Kbytes/sec] received
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   187  332  61.7    338     477
Waiting:      186  331  60.5    337     476
Total:        187  332  61.7    338     477
Percentage of the requests served within a certain time (ms)
  50%    338
  66%    367
  75%    371
  80%    396
  90%    400
  95%    431
  98%    458
  99%    477
 100%    477 (longest request)

If you have used the wget command, then you can probably figure out what ab does. It's a bit like running the command wget -O - http://localhost/drupal7/ many times, from many different CLIs, and timing the results.

The first test - 100 requests from only one client

The first test makes 100 requests, and is polite enough to make only one request  at a time. The ab example above shows the full list of results.

This is the procedure. I repeat this several times, with bigger loads each time.

  1. Open three CLIs to one EC2 machine.
  2. Run the top command in one window.
  3. Run the vmstat command in another window.
  4. Watch the top numbers before ab starts. Nothing much should change.
  5. Run the ab command in the other window. It prints out the first few lines, then seems to hang for a minute or two, then prints out all the summary lines.
  6. Watch the top numbers while ab is running. The CPU changes from 100% idle to 0% idle pretty quickly. Apache and mysql go to the top of the list of processes.
  7. Watch the top numbers after ab is finished. The numbers return to normal.
  8. Quit top and vmstat.
  9. Close the three CLIs.

My results are good. Ab tells me each request was answered quickly (see above).

Top tells me practically no extra load was generated. The apache processes were the CPU hoggers.

15461 www-data  20   0  204m  28m 3624 R 21.4  1.7   0:37.14 apache2
15706 www-data  20   0  203m  28m 3624 S 21.1  1.7   0:17.17 apache2
15537 www-data  20   0  203m  28m 3624 S 14.8  1.7   0:33.85 apache2

I can see from vmstat that no swap was used (the zeroes in the middle here) and all the CPU was used (the zeroes on the right).

2  0  47880 1298120   4504  84336    0    0     0     0  107  414 44  5  0  0
 2  0  47880 1297996   4504  84336    0    0     0     0  121  382 38  5  0  0
 2  0  47880 1298120   4504  84336    0    0     0     0  121  444 35  2  0  0

I will have to increase the workload.

The second test - 100 requests, 2 concurrent clients

I run this ab command: ab -n 100 -c 2 http://localhost/drupal7/.

This is where things get a little closer to the real world. Internet users do not wait patiently for others to finish. They use the website any time and get upset if it takes ten seconds to respond.

It takes much longer to answer requests, but all the requests are finished in good time (less than three seconds).

The load, however, is already too heavy. A load average of 1.81 tells me my service is "CPU bound" - the CPU is the first part of the system that struggles to cope.

The third test - 100 requests, 10 concurrent

ab -n 100 -c 10 http://localhost/drupal7/

Things are going downhill now. Every request is answered within about eight seconds, so it's not a disaster, but I have now stepped over the line.

The fourth test - 100 requests, 50 concurrent

ab -n 100 -c 50 http://localhost/drupal7/

At least 18 seconds to service a request. Oh dear, that's not acceptable. The system is using swap space because it has run out of memory.

The final test - 100 requests, 100 concurrent

I had time to read War and Peace while this test ran. The machine is in big trouble. The CPU was actually idle - it spent all its time waiting for disk activity to finish.

What does this mean?

I know that, even when idle, the load average seems to suggest my EC2 machine is working. The spare CPU capacity is being stolen for other EC2 machines. This is a quirk of virtualization: sharing your house with strangers means making compromises.

I also know that, no matter what test I run, the CPU usage shoots up to 100%. If you are used to a bare metal server, you will find this a little shocking: it's another quirk of virtualization. However, 100% CPU does not mean the service behaves poorly.

Most important for my customers, I know my service can handle less than 10 concurrent connections. That's a problem for a web service and must be fixed. Luckily, my service is intentionally crippled, so it's a problem that should be easy to solve.

By Nick Hardiman

Nick Hardiman builds and maintains the infrastructure required to run Internet services. Nick deals with the lower layers of the Internet - the machines, networks, operating systems, and applications. Nick's job stops there, and he hands over to the ...