Developer

Avoid bad form data with a little CGI validation code

Validating data from a Web form with a CGI script is a standard practice. Find out how to tweak your form validation code using regular expressions and a systematic approach.


Form validation is one of the essentials when you're building a Web form front end. Besides preventing bogus information, validation also helps secure your Web server from a nefarious end user trying to crash your Web site. If you employ CGI Perl to check the data, consider this systematic approach that uses regular expressions to do the testing dirty work.

A normal form submission URL (a GET request in this case) might look like this:
http://www.test.com/cgi/test.cgi?name=Fred;age=30;job=construction

Your program, however, could easily receive this instead:
http://www.test.com/cgi/test.cgi?what@#=5is2;;&==3this%^&stuff#$4!

First line of defense
The first line of defense against bad form data is validation. If the form data seems messed up, don't let your CGI program process it. Some of that defense is done for you: The Apache server picks apart the HTTP request headers that make up the form submission, and the CGI interface inside Apache breaks down some of the details of the form submission.

If your Perl CGI uses a module such as CGI.pm, some further preventative analysis of the submission occurs. However, if you examine that module, or other similar ones, such as URI.pm, you'll find that that analysis is pretty basic. Ninety-nine percent of the time it's enough, but if something goes wrong, you won't hear a peep from CGI.pm because its bad data error reporting is minimal. You'll need your own validation code to take up the slack.

Easy type handling means easy errors
Perl and other scripting languages, like Tcl and JavaScript, offer easy conversion between variable types. There's no compiler bickering at you that your (char *) should be a (struct foo *), as in C/C++. You can just get on with it. The downside is that your program is less bulletproof. Since your form data is put into such variables, sloppy use can damage the validation process. Let's see how.

Listing A (naive.cgi) shows where the trouble begins. This simple code contains numerous type problems (one to a line) that might easily slip into Perl code. There are all kinds of problem variables: ones containing data of the wrong type, uninitialized ones, and just plain missing ones. If you run this script with the –w option, as all Perl scripters should, you'll see the perl5 interpreter complain loudly and at length about most of the problems illustrated. If you remove the –w option, the script will appear to do all required calculations without error.

The –w option is just about mandatory for Perl scripts, but it won't catch all these problems if the data comes from outside the program. It will catch only the ones that occur on a given test run. So without careful analysis of incoming data, who knows what strange calculation or what complaint might issue from your script at some future time. It'll be hard to remember then how everything was supposed to work, too.

Validation steps
Fortunately, submitted form data is supposed to have some structure. Not only is it a set of name=value pairs, it is also supposed to be what your script is expecting. To bulletproof your script, just be very picky checking that what you got is what you expected.

Listing B (formval.cgi) shows these checks broken down into six steps. This script is meant to run by hand on the command line, away from any Web server, not run behind CGI, although it can easily do that too. The script is intended as a demonstration of good practice.

Test 1
Test 1 checks that the whole form submission string makes sense. This is a bit redundant, as we're collecting our data from a CGI.pm object, but some of these checks are still new. Is the data too big? Someone might be trying to flood our program with a huge POST. Is it entirely missing? Is it poorly formed? These are basic format checks that Perl regular expressions (REs) are useful for.

Test 2
Test 2 checks that the parameter names are what we expect. At the top of the script is a set of lists that define which parameters this script expects to receive, leaving the logic entirely data-driven. Just change the parameters for the next script. If an expected parameter doesn't appear, you can quickly question where the form data came from. In the sample script, the regular expression alternation character ('|') is used to match up all the parameters. In fact, the regular expression could be constructed a little more pedantically ("age" also matches "page"), but the principle should be obvious.

Test 3
Test 3 determines how many of each parameter appear. We might expect these in the case of a set of check boxes but not for a text area, menu, or a set of radio buttons. If we get more than one parameter with the same name, we might suspect a flaw in the form doing the submission.

More tests
We have three "does not exist" cases to test. Test 2 covers the possibility that a parameter might be missing. The other two cases are covered in Tests 4 and 5. They are the possibilities that the parameter's value exists but it contains no data (a zero length string) and the possibility that there is no value at all. The last case is particularly important to look at if you're sending the data to a database. NOT NULL table columns can't be inserted into from an effectively null parameter.

Finally, you want to be sure that the supplied values make sense. The %param_types hash provides an opportunity to polish your regular expressions a little. Each parameter's value is checked against an RE that represents the allowed values. The example for "age" illustrates the careful consideration that REs sometimes require: A simple expression like /[0-9]{1,}/ won't do. That expression would allow 453, 00, and 007 as ages. Instead, the supplied RE makes sure that only integers between 0 and 199 are allowed.

Summary
You can do quite a lot of validation when you receive a form submission. Fortunately, the process is much the same for all forms, so get it right once and you can reuse the effort in your other scripts. With this bulletproofing in place, the rest of the processing in your script can be attacked free of any anxiety about the quality of the data you're using.

Editor's Picks