Capturing and replaying Web transactions with Perl is an ideal way to trap and diagnose those frustrating cases and weird data problems that occasionally happen—like when a customer submits an order but the address is omitted, or the credit card number is garbled and you can't track the cause. Analyzing those transactions is also useful for hardening programs against errors and corrupt Web browser data. In this article, we'll examine what this technique entails and introduce a small module that shows some ways to put it to work.
Pluses and minuses of CGI
The MIME-like structure of the HTTP protocol is a simple format that makes creation of basic Web requests easy. Its simplicity, however, is also its weakness. The absence of comprehensive data verification causes ill-formed, changed, or merely unusual HTTP requests to easily pass through the Web infrastructure and land in your program. Those character-based HTTP headers pass through servers almost untouched, regardless of their content. HTTP headers plus Web server information all get passed into the interface.
To test a CGI-handled HTTP request, you may need all the information that passes through the CGI interface. In Perl, a lot of information can be displayed using features of the freely available CGI.pm module and debug submodules, such as CGI::Carp.pm. Of primary interest is the GET and POST data that comes (hopefully) from a submitted form. But these modules don’t provide advanced support for the environment variables that are also passed across the CGI interface, so the information may be incomplete. In addition, too much use of CGI.pm can lead to poor performance, due to numerous string copy and concatenation operations inside Perl. What we need is a way of getting at the HTTP request data without having to deal with CGI.pm or your transaction.
In some respects, at least, the environment variables are no big deal. After all, with a simple application of the %ENV hash, you can get the variable you want. During the test phase, or when debugging some obscure problem, these variables plus the accompanying form data are likely to come into sharp focus. However, if your end users are the public (or worse still, engineers), who knows what junk their obscure browser and first draft HTML might throw through the CGI interface? Some neat way of capturing the whole HTTP request experience, including environment, would be handy. The following simple example will show you how to accomplish this.
Download the code discussed in this article
Enter a new, small module CGI::Batch.pm. This is a developer support module, and it's pretty harmless in effect. You wouldn’t generally include it in the module installation for a live site. You would probably lean on it while polishing up your Perl, maybe during system testing. It contains two simple routines:
The record() subroutine writes out the environment to the first file. If it detects the HTTP headers are followed by content (as is the case for a POST transaction), it reads everything available on STDIN into the second file. Then, it closes STDIN and reopens it at the beginning of that second file. Any processing after that, such as you might do with CGI.pm, is unaware that anything has changed. This means that the HTTP request continues through your program as though record() never happened. Of course, close inspection of the STDIN file handle would reveal the change, but there’s not much need to do that in CGI.
The play() subroutine does the same fiddling around, but in reverse. It first overwrites your current environment with the contents of the saved environment file. If it detects a POST request, it closes STDIN and reopens it based on the saved POST data. That environment and POST input then passes through your Perl logic as if it came from a Web server.
Recall that the METHOD attribute of the <FORM> HTML tag can be set to either GET or POST. In both CGI::Batch routines, POST data is detected by searching for a CONTENT_LENGTH environment variable. No variable means no content.
A sample debug cycle
A typical use of CGI::Batch goes as follows. Start by adding these lines to the start of your troublesome CGI program (your program should be installed behind a Web server, preferably a test Web server):
Now, run your Web client as normal. After the CGI program has run once, new files exist in the server’s /tmp directory. View them as a sanity check, and perhaps you’ll spot two HTML form elements with the same NAME or ID—a common client-side slip.
Copy the CGI program somewhere temporary and change the record() subroutine to play() in the copy. Run the copy by hand on the command line and watch it repeat the CGI transaction that occurred before. You can repeat it over and over and you can use the Perl debugger and other niceties. You can also experiment with changing the input files. No more guessing in the dark or trawling through debug logs.
These routines are actually quite general, so they'll be effective regardless of the type of content that the interface encounters. For instance, some madman might connect to your Web server’s IP address and port number using a Telnet client. He might start typing HTTP headers and then change his mind and make the rest of the headers SMTP (e-mail). In that case, provided the Web server itself doesn’t choke horribly, these routines will still pick up the content that makes it across the CGI interface. The HTTP headers depend on the underlying MIME standard just as SMTP headers do, and this tool relies on common MIME information.
This article includes a simple client-server transaction you can play with. (The links to the necessary files appear above.) Formtest.html sends the simplest of forms to the server. Test.cgi ignores the form content and instead returns an HTML page that displays the contents of the files recorded by CGI::Batch.pm. It is very straightforward. Try running Test.cgi both by hand and installed behind a server. Remember: No changes are required to process the form data directly—just do what you always do.
CGI::Batch is such a simple trick that implementing it differently to support mod_perl isn’t hard. Just integrate the dumping of a file with the point in your code where STDIN is read and use record() and play() to set flags to trigger an if around that reading point. Easy.
As always, be cautious when using mod_perl. Because the pool of static data is larger for CGI programs running under mod_perl than for separately executable CGIs, the total interface presented to your script is technically bigger than CGI::Batch. To use CGI::Batch in such an environment, you’ll have to turn off mod_perl support temporarily because it closes and reopens STDIN. Although turning off mod_perl is inconvenient, it also forces you to test your mod_perl use. If a captured HTTP request works stand-alone replayed through CGI::Batch but not live under mod_perl, you can’t blame the Web server. In that case, you’d better refresh yourself on the mod_perl rules.
Finally, if your CGI script interacts with a relational (or other) database, beware of "irreproducability." You can’t add part number 2004 to a relational database more than once if part number is a primary key. In that case, blindly rerunning play() won’t work. Since your Perl code is neatly divided into interpreting the form submission (isn’t it?), processing the data transaction, and rendering the replacement page, you can easily work around this. Just print instead of executing the database operations. For a fully reproducable through-test of a database transaction, CGI::Batch needs to be accompanied by a little more infrastructure. That’s an article for another day.
CGI::Batch removes uncertainly from the CGI testing process. Your Perl CGI can now be white box tested as well as black box tested. According to received wisdom, white box testing is faster at fixing specific defects, whereas black box testing is better at finding defects. Anything that provides a controlled environment for testing has to be a good thing for an interpreted environment.