Enterprise Software

Using Perl to take control of HTTP caching

Proper use of caching techniques can make a big difference in application performance and reliability. Learn how to use Perl to cache and expire data within a Web browser.

A user types a URL into a Web browser. The browser fetches it from a Web server. The user examines the result. Then, a thousand users do the same thing. This time, the network slows to a crawl, the server crashes, and the fetched resources are all out of date. Not so good. If only the server-side programmer had planned to use HTTP caching and expiry. This article shows how to take advantage of these features with the help of Perl.

More than just server and browser
When you're sitting at the test bench, spinning your new application server code through its paces, life is pretty simple. A test client computer with a browser and a test Web server are your environment. If only it were like that in real life.

In actuality, the server is the sole HTTP "origin server." In between it and the user's Web browser could be any number of proxy HTTP servers, each with its own caching policy. HTTP provides a complex content negotiation model allowing you to retrieve a URL from your server, but have it come from an intermediate proxy server that remembered it from an earlier request. You might know nothing about that proxy. It is all in the name of saving Internet bandwidth. See RFC 2616: "Hypertext Transfer Protocol: HTTP/1.1" for the gory details.

For static information, like images and plain HTML pages, this series of proxy caches is handy. When a thousand users type in the same URL, many of those users are served by a proxy. This saves both the server and the network bandwidth near the server.

But for dynamic information, such as that produced by any Web-based application, proxy caches are a definite source of concern. How do you know that a dynamically-generated page from hours, days, or weeks ago isn't sitting in a cache somewhere waiting for an innocent browser request?

HTTP contains some safeguards against this kind of thing. All HTTP request-response pairs are supposed to be "semantically transparent" when necessary. This just means that the presence of any proxies between the origin server and the client must not alter the net exchange of information.

This semantic guarantee relies on both ends following the rules of HTTP. If a Web page is dynamically generated from your program, you could be responsible for some of these headers, so it's best to know what headers to put in.

What Web browsers ask your server program for
HTTP content negotiation is a standard procedure that can be modified easily by either the client-requester (browser) or server-respondent (Web server). First, you need to know what that tricky Web browser is sending.

Before confusion sets in, note that the Web browser cache has little or nothing to do with HTTP. If there was ever any doubt, we'll consider the case of Internet Explorer 6.0 below.

The Perl in Listing A, hacked along the lines of the standard advice in the perlipc man page, can be used to sniff a Web browser's headers. This script must run on your Web server with either administrator (Windows) or root (Linux/UNIX) privileges. Any existing Web server, such as IIS or Apache, must be shut down.

If you run this script, you'll see it sit there and do nothing in particular. Jump into the Web browser and try to load a URL at that server. You'll see the browser hang (never mind), but the script will print the headers on the server. Then, the script will hang (never mind). Just kill the script to start over again and unblock the browser. You've got the headers, which was the point. Here are a few of the headers IE 6.0 outputs:
GET / HTTP/1.1
If-Modified-Since: Thu, 06 Jun 2002 06:49:32 GMT
If-None-Match: "340da-fa-3cff05fc"
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; TUCOWS)
Host: saturn
Connection: Keep-Alive

Note two points here. First, no matter how you change the caching options for IE 6.0, the headers stay the same. So the client-side browser cache (under Tools | Internet Options | General | Temporary Internet Files | Settings) is a nonissue. Ignore it. Second, and more important, is the evil If-Modified-Since header. This header demands that a modification date be available in the HTTP response to assist with caching strategy. Your programmatically served-up code should provide that date information.

Bear in mind that there is some sanity at work here. HTTP's semantic transparency helps you if you do nothing with headers. POST requests (form submissions) in particular are correctly handled. If you use a GET request, however, or modify the response headers of a POST request, unexpected caching can occur.

For completeness, Mozilla/Netscape puts out this If-Modified-Since header only when Edit | Preferences | Advanced | Cache is set to When The Page Is Out Of Date. That is a safer browser strategy for server-side programmers, but also dumber.

Correctly annotating your dynamic HTTP responses
So here's how to do The Right Thing. Since you're going to mess with the HTTP response headers in your dynamically-generated page, always observe the first rule of network programming: What you send must be perfect. In particular, this means that dates must be formatted exactly to the standard. For example:
Date: Sun, 21 Jul 2002 08:12:13 GMT

Another format uses the UNIX date(1) command:
Date: Sun Jul 21 08:12:13 EST 2002

You output headers before returning any HTML/XML content and before the blank line preceding that content. You can output headers in any order.

The Date header doesn't help with caching at all. It dates only the message, not the URL resource that's being returned. The Perl CGI module, for example, tacks it on automatically for you. Forget it.

Your biggest gun is the HTTP header Cache-Control. If you set it like this:
Cache-Control: no-cache

the URL resource you return to the browser will never be cached. All fixed. If by some unhappy chance your application has been running without this header, a tiny possibility exists that an old copy might be hanging around somewhere in the world, waiting to cause trouble. But unless you're Amazon.com, that's pretty unlikely.

To benefit from caching, but with an insurance policy, use the Expires header. This header states the date that any cache copies must be dropped. So for a one-week media blitz, perhaps set this expiry date seven days in the future. Setting it to a day in the past is the same as Cache-control: no-cache.
Expires: Sun, 28 Jul 2002 08:12:13 GMT

For your dynamic pages, the header you really need to include is Last-Modified. Normally, this header is some file system date stamp. Since your page is dynamically generated, it doesn't have such a date stamp, so it needs to be coded in. Also, this is the header matching the If-Modified-Since header from the browser. You really should make the effort to add this header—even if Perl CGI doesn't add it for you. Here's an example:
Last-Modified: Thu, 25 Jul 2002 08:12:13 GMT

Make the modification date equal to "now" and any queries from browsers (and caching proxies) will know that the page changed when you generated it.

Exploiting read-consistent views
So far, all the headers we've noted are designed to reduce caching or turn it off completely. But you can exploit caching too.

In database terms, read-consistent views occur when you query a database for some data. You don't want your data to be messed up by last-minute additions made by other users. The database handles this by marking a point in time and sending you the state of all the data at that point.

You can do the same using the HTTP caching system, especially in a corporate setting. Suppose that a given report requires extensive computation—maybe it's a monthly transaction summary report. Place a Web proxy (perhaps Squid) between your Web server and the user community. The first time someone wants to view the report, you'll have to generate it. However, you can arrange matters so that you needn't generate it again. You can either set the expiry date on the generated copy to 28, 29, 30, or 31 days in the future (depending on the month) or set the Last-Modified date to the fixed date when the first copy was generated. That way, the cache knows on all subsequent requests that its existing copy is already up to date. On subsequent requests, just generate a trivial empty report and rely on the cache sending its earlier copy to the user.

These tricks can also be used on static HTML pages. Just use a tag like this in the HEAD section:
<META http-equiv="Expires" content=" Thu, 25 Jul 2002 08:12:13 GMT">

If you don't play with the time-oriented headers in your Web server programs, you might notice during testing that things seem to be awfully inconsistent. That's caching behavior at work. Worse, it might work in the test lab, where there are no caches, but not in the real world. For correct behavior, always add proper date stamps and cache instructions to your dynamically-generated pages.


Editor's Picks