Optimize your testing and analysis practices for distributed systems

Developing a thorough test plan and benchmarks for a distributed system takes careful planning. Follow this advice on tracking down bottlenecks and analyzing performance results.

This is the second in a two-part series detailing methodologies of performance analysis for distributed computer systems. The first article discussed setting up the test environment and developing a test matrix. In this article, we’ll examine more closely test-plan requirements and the testing process itself. We’ll finish by looking at the way performance benchmarking can quantify the impact of system modifications.

The test plan
Although frequently overlooked, the test plan is a crucial piece of the puzzle. Make sure it’s spelled out in granular detail before testing begins. The test plan defines testing expectations not only to the development group but also to other managers in the organization. By establishing clear expectations before you get into heavy analysis, you avoid after-the-fact grumbling from developers who’ve decided they needed other data points after all. The test plan also provides documentation for comparing future changes to the application.

Simply put, the test plan establishes the following:
  • Application dimensions that will be tested, and why
  • Configuration of the test architecture
  • Testing procedure
  • Metrics that will be collected
  • Expected outcome(s)

Devote a section of your test plan to each of these points. You may also wish to include a section for testing prerequisites if you require special environment or application configurations. Note any associated risks that may be apparent in the testing process or final results. Furthermore, when outlining your expected results, you should detail the steps to be taken if the results differ significantly from the initial expectations.

The testing process
The primary goal of testing is to deliver a comprehensive, accurate view of system performance to the engineering organization. Decisions that might be made from collected test data include the following:
  • Application configuration settings
  • System architecture design decisions
  • Coding practices
  • Scalability and capacity planning

Performance measurements taken against a distributed computer system are not always consistent. A large number of variables deep inside operating systems and hardware implementations beyond those at the application level can affect performance, sometimes quite severely, even if for very short periods of time. Make sure your test spans enough time to smooth these spikes out of your overall view. Generally, a measurement of performance at a constant load level should span several minutes, and a measurement of performance at a ranging load level should span several hours.

Like the directions on a shampoo bottle, the testing process boils down to a few simple steps:
  1. Initialize the test environment.
  2. Run the test.
  3. Gather the results.
  4. Repeat the process.

You can roughly determine an average value and a margin of error with as few as five or six sets of results. Ideally, you should have three times that many data points to work with, but it is often hard to convince your boss that this much work is necessary for the sake of accuracy.

Shell scripts do the trick
As I suggested in the previous article, you should automate as much of the process as possible to reduce operator error and save valuable time.

Remote shell scripts are a good solution for achieving a fair portion of test automation in a multiserver environment. With packages like Cygwin, a UNIX environment for Windows, you can even extend these scripts to Windows environments without too much of a headache. The following scripts are useful:
  • Build automation—If you need to compare the performance of two application builds, having a script to deploy and configure the entire application onto the appropriate machines in your test environment for a specific version number reduces the time involved in the configuration process.
  • Remote monitoring—The best source of system metrics is on the system itself. If a top-tier remote monitoring solution isn’t financially feasible, consider a batch of scripts that start and stop monitoring applications, such as vmstat, iostat, and netstat on remote hosts. These scripts allow you to obtain fine-grained system data for free and with little effort. If you don’t intend to use all of the data, save it for future use.
  • Log file archiving—When the test is over, create a single script that collects data from a cluster of machines, archiving log files at a central location. While it may seem tedious, maintain detailed documentation on a notepad or Word file of how the test proceeded. When performance problems pop up, this documentation will prove invaluable in the troubleshooting process.

Analysis of performance test results
If you are faced with a mountain of data for a single set of results, determine which is essential for the performance view being tested. For example, if the transactions you’re executing touch lightly on the database, so too should your analysis. If you’re testing the business logic in a set of Enterprise JavaBeans, focus your attention on that component in the distributed application. It may be necessary to add extra logging or profiling logic to the application to get the data you need, but don’t let your results be distorted with the extra logic. Always compare common performance metrics, such as peak throughput and minimum latency, between builds. Test with the added logic and without to see if any degradation occurred.

Determining system bottlenecks
On a distributed computer system, typically, there is an unbalanced performance capacity across all of the machines. Certain processes are cheaper to scale than others, so more machines may be added as needed. Still, it is important to determine which machines or processes are constraining overall throughput or adding the most latency to a transaction.

In terms of latency, if you can log the time it takes to process a given transaction at each stage in the distributed system, you can break down the transaction into a time series view that will graphically show which areas consume the most time and are candidates for improvement. Code and system profilers can then be used to examine those pieces that are taking the most time at a lower level to point out any areas that need optimizing.

Locating throughput bottlenecks involves some investigation. You need a good understanding of process behavior for each component in the distributed system. Certain components, such as Web servers, will be CPU bound or possibly network I/O bound. Other components, such as databases and queues, are likely to be disk I/O bound. Some complex application processes may be bound at CPU, disk I/O, memory paging, or by interprocess contention depending on the use case. If your system consists of many machines and high transaction rates, you may very well be bumping up against the limitations of your available network bandwidth. The only way to tell how your application will behave is by evaluating the fundamental architecture and conducting confined test cases.

Once you understand the constraints on each component in the application, you can identify the bottlenecks. Check for saturated usr/sys CPU rates, high paging activity in the virtual memory layer, and elevated disk and network bandwidth, or high CPU utilization waiting on I/O. If nothing shows up here, check run-queue depths. Large run-queue depths indicate constrained processes that may be blocked or CPU bound in some way.

Some operating systems allow you to examine process state and accounting details, offering insight into time spent waiting on I/O or user locks. Another possible culprit is network-queue depths. When they start building up on a machine, it can mean that this particular computer or the box it is “talking to” are constraining the total system throughput. If worse comes to worst, you can test each machine individually.

As applications move toward a service model, the change management life cycle shortens, increasing the importance of performance validation and benchmarking. The most valuable benchmarks include comprehensive result sets in a real-world scenario.

Begin with the log files you have collected from your test iterations over a set period of time. Feed the log file back into the application as fast as possible and measure the amount of time it takes to run. While not a surefire method, this is a quick and dirty way to get feedback quickly. A more attractive approach is to abstract the benchmark from the application log files, which provides for a more extensible and configurable test.

Classify the requests from log files into weighted buckets based on frequency of requests. For example, if you determine that 20 percent of requests to an HTTP server are for a CGI program, 30 percent of requests are for HTML pages, and 50 percent of requests are for images, you can construct a benchmark of these data points that will allow you to determine how system performance would be affected if you halved the number of images. Further investigation can include determining probability distributions for each bucket and then building a program to model the distributions.

Once established, the benchmark should be thoroughly compared with statistics from the production environment. System utilizations and timing measurements should be close. If your test system lacks the robustness of your “live” environment, you can develop theoretical scaling multipliers.

Realizing that performance testing is a procedure-rich process is perhaps the first step towards simplifying it. The rewards from the initial time spent preparing for testing and developing the tools to automate and simplify the process will pay off greatly.

Editor's Picks