You’ve probably already heard of RSS, the XML-based
format which allows Web sites to publish and syndicate the latest content on
their site to all interested parties. RSS is a boon to the lazy Webmaster,
because (s)he no longer has to manually update his or her Web site with new
content.
Instead, all a Webmaster has to do is plug in an RSS client,
point it to the appropriate Web sites, and sit back and let the site
“update itself” with news, weather forecasts, stock market data, and
software alerts. You’ve already seen, in previous
articles, how you can use the ASP.NET
platform to manually parse an RSS feed and extract information from it by
searching for the appropriate elements. But I’m a UNIX guy, and I have
something that’s even better than ASP.NET. It’s called Perl.
Installing XML::RSS
RSS parsing in Perl is
usually handled by the XML::RSS CPAN package. Unlike ASP.NET, which comes with
a generic XML parser and expects you to manually write RSS-parsing code, the
XML::RSS package is specifically designed to read and parse RSS feeds. When you
give XML::RSS an RSS feed, it converts the various <item>s in the feed
into array elements, and exposes numerous methods and properties to access the
data in the feed. XML::RSS currently supports versions 0.9, 0.91, and 1.0 of
RSS.
Additional resources
- Read
more about the RSS specification - What’s
wrong with RSS is also what’s right with it - QuickStart:
Really Simple Syndication (RSS)
Written entirely in Perl, XML::RSS isn’t included with Perl
by default, and you must install it from CPAN. Detailed installation instructions are provided in the
download archive, but by far the simplest way to install it is to use the CPAN
shell, as follows:
shell> perl -MCPAN -e shell
cpan> install XML::RSS
If you use the CPAN shell, dependencies will be
automatically downloaded for you (unless you told the shell not to download
dependent modules). If you manually download and install the module, you may
need to download and install the XML::Parser module before XML::RSS can be
installed. The examples in this tutorial also need the LWP::Simple package, so
you should download and install that one too if you don’t already have it.
Basic usage
For our example, we’ll assume that you’re interested in
displaying the latest geek news from Slashdot on your site. The URL for
Slashdot’s RSS feed is located here. The script in Listing A retrieves this feed, parses it, and turns it into a
human-readable HTML page using XML::RSS:
Place the script in your Web server’s cgi-bin/ directory/. Remember to make it executable, and then
browse to it using your Web browser. After a short wait for the RSS file to
download, you should see something like Figure
A.
Figure A |
![]() |
Slashdot RSS feed |
How does the script in Listing A work? Well, the first task
is to get the RSS feed from the remote system to the local one. This is
accomplished with the LWP::Simple package, which simulates an HTTP client and
opens up a network connection to the remote site to retrieve the RSS data. An
XML::RSS object is created, and this raw data is then passed to it for
processing.
The various elements of the RSS feed are converted into Perl
structures, and a foreach() loop is
used to iterate over the array of items. Each item contains properties
representing the item name, URL and description; these properties are used to
dynamically build a readable list of news items. Each time Slashdot updates its
RSS feed, the list of items displayed by the script above will change
automatically, with no manual intervention required.
The script in Listing A will work with other RSS feeds as
well—simply alter the URL passed to the LWP’s get() method, and watch as the list of items displayed by the
script changes.
Here are some RSS feeds to get you started
Tip: Notice that
the RSS channel name (and description) can be obtained with the object’s channel() method, which accepts any one
of three arguments (title, description or link) and returns the corresponding
channel value.
Adding multiple sources and optimizing performance
So that takes care of adding a feed to your Web site. But
hey, why limit yourself to one when you can have many? Listing B, a revision of the Listing A, sets up an array containing
the names of many different RSS feeds, and iterates over the array to produce a
page containing multiple channels of information.
Figure B shows
you what it looks like.
Figure B |
![]() |
Several RSS feeds |
You’ll notice, if you’re sharp-eyed, that Listing B uses the
parsefile() method to read a local
version of the RSS file, instead of using LWP to retrieve it from the remote
site. This revision results in improved performance, because it does away with
the need to generate an internal request for the RSS data source every time the
script is executed. Fetching the RSS file on each script run not only causes things
to go slow (because of the time taken to fetch the RSS file), but it’s also
inefficient; it’s unlikely that the source RSS file will change on a
minute-by-minute basis, and by fetching the same data over and over again,
you’re simply wasting bandwidth. A better solution is to retrieve the RSS data
source once, save it to a local file, and use that local file to generate your
page.
Depending on how often the source file gets updated, you can
write a simple shell script to download a fresh copy of the file on a regular
basis.
Here’s an example of such a script:
#!/bin/bash
/bin/wget http://www.freshmeat.net/backend/fm.rdf -O freshmeat.rdf
This script uses the wget
utility (included with most Linux distributions) to download and save the RSS
file to disk. Add this to your system crontab,
and set it to run on an hourly or daily basis.
If you find performance unacceptably low even after using
local copies of RSS files, you can take things a step further, by generating a
static HTML snapshot from the script above, and sending that to clients
instead. To do this, comment out the line printing the “Content-Type”
header in the script above and then run the script from the console,
redirecting the output to an HTML file. Here’s how:
$ ./rss.cgi > static.html
Now, simply serve this HTML file to your users. Since the
file is a static file and not a script, no server-side processing takes place
before the server transmits it to the client. You can run the command-line
above from your crontab to regenerate
the HTML file on a regular basis. Performance with a static file should be
noticeably better than with a Perl script.
Looks easy? What are you waiting for—get out there and start
hooking your site up to your favorite RSS news feeds.