Meet diverse needs by using RSS to aggregate content

Web developers are often asked to create sites that cover every possible audience need and interest. That’s why content aggregator functionality, such as RSS, has become so popular in the Web dev community—and why it should be part of your bag of tricks.

RSS is an XML format used to supply selective, summarized Web content to content aggregator clients. More precisely, RSS is a “lightweight, multipurpose, extensible, metadata description and syndication format,” conforming to W3C’s RDF specification.

Many versions of RSS are available—such as 0.91 from Netscape and the latest, 2.0, from UserLand—each one having a few unique features suitable for a certain kind of content. You can choose to implement any one of these versions based on the requirement. They’re equally popular, and most RSS tools and aggregators work with all of them. To be on safe side, you can use a subset of RSS elements common to all versions to maintain compatibility with all the others. This article is based on the widely used version 1.0 of RSS, but the information offered here will apply to the other versions, as the basics of all versions are same.

A peek inside RSS
The RSS specification describes a simple set of XML-style elements that can be used to create a summary of a Web site’s content. This summary may consist of a Web site logo, a site link, an input box, and multiple “news items.” This summary or a collection of summaries from a Web site is known as an RSS feed.RSS feeds are published and syndicated by content providers’ sites and consumed by content aggregator Web sites, also called portals, or by stand-alone desktop tools.

RSS feeds can be generated manually by creating and posting an RSS file (e.g., latest_news.rss) to a Web site. Various tools and online services “scrape out,” or generate, RSS feeds automatically from the existing content of a Web site, which often proves useful when culling data from sites that offer dynamic content. For Web sites developed with Perl, the module XML::RSS can automate creation of an RSS feed. For ASP-based sites, a collection of tools is available at TNL Net. Xpath2rss is a tool for scraping Web sites using XPath expressions. Online scraping services like myRSS and Site Summaries in XHTML are also available.

Let’s look at the elements of RSS and consider some examples of their usage to see how to create a complete .rss file containing an RSS feed.

Generating an RSS feed
You can create an RSS file containing an RSS feed using any text or XML editor. An RSS file contains both root elements and RSS elements, described below.

Root elements
An RSS feed, being a valid XML document, may begin with an XML declaration, <?xml version=”1.0″?>. Including this declaration is optional, although recommended, to maintain backward compatibility with older versions of XML.

To conform with RDF specifications, the remaining RSS elements that form the RSS feed must be enclosed inside the root element using <rdf:RDF> and </rdf:RDF> tags. This root element associates the rdf namespace prefix with the RDF syntax schema and makes it the default namespace for the document. The code below shows the skeleton of an RSS file:
<?xml version=”1.0″?>
<rdf:RDF
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns=”http://purl.org/rss/1.0/”>
….
</rdf:RDF>

RSS elements
An RSS feed usually consists of four major elements: <channel>, <image>, <item>, and <textinput>. The <channel> element is mandatory, as is at least one occurrence of the <item> element. The <textinput> and <image> elements are optional, their use being need-based.

<channel>
The <channel> element contains a brief description of Channel (the source of the RSS feed). It has an attribute rdf:about=”resource_URL”, where resource_URL is a unique URL pointing to either the homepage of the feed provider or the URL of the RSS feed itself. The <channel> element contains the following child elements, which are required unless otherwise specified:

<title> is the name/title of the channel.
<link> is the URL of the Web page containing complete content related to the channel’s content.
<description> is brief information about the content of <channel>.
<image> is an optional and empty tag. It’s required only when an outer <image> element is present. It has one attribute, rdf:resource=”image_url”, where image_url is the URL of the image associated with the channel (usually the channel logo).
<textinput> is an optional and empty tag. It’s required only when an outer <textinput> element is present. It has one attribute, rdf:resource=”textinput_url”, where textinput_url is the target URL of a user input form.
<items> is a list of content items included in a feed. It has following syntax:

Here, item_n_url is the content source URL for an item. Each <rdf:li /> entry corresponds to one <item> element.

The <channel> element’s<title> and <link> can be rendered together as a hyperlinked headline, followed by a <description> element. The <channel> element serves as a table of contents for the RSS feed, with its children <image>, <items>, and <textinput> pointing to the location of the corresponding RSS elements, <image>, <item>, and <textinput>. You can find more information about these in the RSS specifications. The code below shows a populated <channel> element:
<channel rdf:about=”http://www.xml.com/xml/news.rss”>
<title>XML.com</title>
<link>http://xml.com/pub</link>
<description>
   XML.com features a rich mix of information and services
   for the XML community.
</description>
<image rdf:resource=”http://xml.com/universal/images/xml_tiny.gif” />
<items>
   <rdf:Seq>
    <rdf:li resource=”http://xml.com/pub/2000/08/09/xslt/xslt.html” />
    <rdf:li resource=”http://xml.com/pub/2000/08/09/rdfdb/index.html” />
   </rdf:Seq>
</items>
<textinput rdf:resource=”http://search.xml.com” />
</channel>

<image>
The <image> element specifies the image associated with a channel, preferably of 88×31 pixel size. It has an attribute rdf:about, whose value is the same as the value of the attribute rdf:resource of <image> inside <channel>. The <image> element has the following child elements, required unless specified otherwise:

<title> is the image’s alternative text (alt attribute of HTML <img> tag).
<link> is the URL of the image’s source, usually the homepage of a channel provider.
<url> is the URL of an image on the channel provider’s Web site.

<item>
The <item> element specifies an item, such as a news article headline, hyperlinked to complete content on a channel provider’s Web site and followed by a short description. This element forms a dynamic part of the RSS feed. You are permitted between one and 15 items per feed. An <item> has one attribute, rdf:about, whose value is the same as the value of rdf:resource of the corresponding list entry of <items> inside <channel>. The <item> element has the following child elements, required unless specified otherwise:

<title> is the name/title of an item.
<link> is the URL of complete content related to an item. Its value should be identical to the value of the rdf:about attribute.
<description> is an optional, brief description of an item that appears after a hyperlinked item title. The maximum is one occurrence per item.

<textinput>
The <textinput> element is used to render an HTML form field to submit user input. It has one attribute, rdf:about, whose value is the same as the value of the attribute rdf:resource of <textinput> inside <channel>. The <textinput> element has the following child elements, required unless specified otherwise:

<title> is the title of the input field, e.g., Submit or Search.
<description> is a brief description of the input field’s purpose, e.g., Submit your feedback.
<name> is the name of the input field.
<link> is the target URL to which input field submission is directed. Its value is the same as the value of rdf:about.

A useful tutorial covering major aspects of RSS is available at RSS Tutorial for Content Publishers and Webmasters.

Using these elements, an RSS feed can be created and saved in a .rss file. Listing A illustrates a complete RSS file, xmlcomfeed.rss. (This is an excerpt from RDF Site Summary 1.0).

Validating an RSS file
After generating an RSS file, it should be validated to check for errors. Many RSS validators are available online to perform this task, such as Online RSS 0.9x Validator and Online RSS 1.0 Validator.

Publishing an RSS feed
After generation and validation, RSS files are published online by being posted on the Web site. Now, you have to advertise the availability of an RSS feed on the Web site and syndicate the feed to take it to a larger audience.

To inform people about the availability of an RSS feed, you can include links like the following one on Web pages:
RSS feed for this page is <a type=”application/rss+xml” href=”URL_of_feed.rss”> available here</a>

An alternative is to put a <link> tag inside a <head> tag of an HTML page, as follows:
<html>
<head><title>Newsflash</title>
<link rel=”alternate” type=”application/rss+xml”
href=”URL_of_feed.rss ” title=”RSS news feed”>
</head>
….
</html>

Although it is good idea to provide a “central” RSS feed linked to the homepage of a Web site, it is possible to provide separate RSS feeds for various sections of the site with dynamic content. A simple way to syndicate an RSS feed is to let anyone who is interested to subscribe and aggregate your content. Other ways are syndication by registering an RSS feed with directories like Yahoo or submitting your RSS feed URL to content aggregator portals.

Consuming an RSS feed
Consuming an RSS feed means parsing the feed and converting its content into a displayable format. RSS feeds can be consumed by both content aggregator portals, such as My Yahoo, My UserLand, Meerkat, and Moreover, and by tools such as Headline Viewer, NetNewsWire, and Radio UserLand for personal as well as commercial use. Plug-ins are also available for some e-mail clients, such as MS Outlook, that do desktop-based content aggregation for personal use.

In addition, you can write simple scripts in any language supporting RSS—including Java, PHP, Perl, ASP, and C#—to parse an RSS feed. Listing B gives an example of such a program written in Java. (This excerpt is in part from O’Reilly XML.com.) This program also gives a general idea of how RSS feeds should be parsed using other scripting languages.

RSS at work
RSS is easy to understand and implement. With modularization and RDF compliance, RSS is further evolving to fulfill growing application needs, including aggregation, discussion threads, job listings, top-10 listings, multiple listings services, sports scores, and document cataloging.

Meet diverse needs by using RSS to aggregate content

mugdhavairagade