Software

What's wrong with RSS is also what's right with it

The popular Web syndication's brand of flexibility promises to make life difficult for all those attempting to bring order to the natural chaos that defines the Internet.

Stay on top of the latest tech news with our free IT News Digest newsletter, delivered each weekday. Automatically sign up today!

By David Berlind
Tech Update
COMMENTARY — The variety of Web syndication techniques is one of those proverbial situations where the greatest thing about standards is that there are so many of them. It would require thesis-level research to make real sense of RSS 1.0, RSS 2.0 (which is not the successor to RSS 1.0, but rather version 0.94), Atom, and the gaggle of tangentially connected Internet syndication technologies. If there are problems at the specification level and they can't get worked out (my sense is that some conflicts have been overblown by the press), can we expect it to get any easier in the trenches? This story is about us users just trying to get something to work.

One problem that I've run into with RSS 2.0 is the way publishers often publish their RSS-based feeds using different conventions. Although this flexibility is one of RSS 2.0's greatest benefits, the burden of normalizing multiple RSS feeds for aggregation and presentation shifts to the consumption side. Whereas end-user applications (including RSS aggregators) often include intelligence to make up for formatting anomalies in the documents they deal with (which means the outlook is rather good for RSS-consuming applications), I wonder whether RSS might also prove just how difficult it will be for vendors to deliver on the development-for-mortals promise — the one where technical neophytes will be able to build complex, transactional, server-side applications with a point, click and drag.

RSS is, after all, today's poster child of what XML can do for the masses. It's also the closest that most people have come to working with XML. Given RSS' momentum, it could very well turn into the primary method by which all data (structured or unstructured) gets pumped — regardless of whether the application is just to stay abreast of Weblogs, to retrieve e-mail (boy, wouldn't that put an end to spam?), or to pass transactional data through a complicated workflow. As such, RSS is also the prime candidate to be a proof-point for point-and-click programming.

In an effort to better collaborate over ZDNet's presence in the blogsphere; we've established an internal Wiki. Although it will probably become so much more, our Wiki's home page is currently more like a shared bookmark repository. I did establish one secondary page that demonstrates the power of the multi-user system: a shared view of the Weblogs (blogs) that my workgroup must follow. I used Twiki's headlines plug-in to aggregate RSS-based syndication feeds onto one page that's basically a portal into the corner of the blogsphere that I think we should be watching. I call it our radar.

Shortly after I added the feeds from Robert Scoble , Jonathan Schwartz, Tim Bray, Bob Frankston, Slashdot, Groklaw, and others, ZDNet.com Vice President Stephen Howard-Sarin enhanced the page with the feeds he watches from folks like Dan Bricklin, Jon Udell, Dan Gillmor, Phil Windley, and Doc Searls. Although this wasn't point-and-click server-side programming, it was darn close.

Out of the box, the plug-in's default format for picking up a feed and displaying it on a Web page didn't appeal to me. So, for all our feeds, I supplied the plug-in with some optional parameters to ensure that the final page was text only (for performance) and that it only picked up the last five headlines from each feed. Graphically selecting such parameters, automatically generating the resulting code, and dropping that code into a Web page is precisely the sort of exercise that I imagine myself doing with point-and-click programming rather than hard-coding (my current approach). In a spate of Wiki political correctness, Howard-Sarin followed my lead on the formatting when he added his feeds. Short of something easier to use, he just cut and pasted from code and supplied the necessary substitutions. Not bad for a bunch of non-programmers, eh? In a few hours, a two-person collaboration produced a portal with meaningful information that gets updated every time the page is refreshed. Now, we were just waiting for a few other ZDNetters to drop their non-overlapping favorite feeds onto the page.

But there was a problem. Some of the feeds weren't displaying correctly and I ended up spending way more time on the project than I should have. For example, for each of his blog entries, Jonathan Schwartz's feed does something that the others do not — for each unique blog entry (known as an item) in his XML formatted feed, Schwartz omits the link field. Most feeds, such as this one from ZDNet use the link field to store the URI that connects directly to an individual item (in this case a news story instead of a blog entry). In absence of the link field, Schwartz relies on the GUID (Global Unique Identifier, pronounced "gwid") field with an option (known as "permalink") to store a permanent link to each of his blog entries. This makes sense because, across the entire Internet, the dedicated link to a specific item of content is about as globally unique as you can get. You could come up with something else, but why bother? For this reason, just about everybody stores the direct links to their items in the GUID. For many, this means the data found in the GUID is the same as what can be found in the link field (if they're using it).

Why does any of this matter? Well, it mattered to me as I tried to come up with an easily re-usable collection of plug-in parameters (in the point-and-click world, this would be an "object" but I'll believe it when I see it) for building our portal page — a collection that would be applied identically to each of the feeds that our page watches. If an object only works some of the time in the point-and-click-programming-for-mortals world, it won't be long before mortals throw in the towel.

The TWiki documentation, with its example usage of the link field, is what started me down the path of using it to present portal users with a link back to the original item. Makes sense, right? However, when the link field is missing, as it is in Schwartz's blog, all I get on my portal page is a block of dead links. It was the investigation of this issue that took me into the underbelly of RSS — a place that I can't imagine other users having to go simply to tap into the amazing power of Web syndication. I learned that if Schwartz's blog was storing a dedicated URI to the item in the GUID (which he was), I could rely on the contents of the GUID to point users back to the original content. In the name of re-useability, I considered doing this for all the feeds we were watching.

A passage about the RSS 2.0 specification, which explains how links and GUIDs are not always the same thing, confirmed my inclinations:

"A frequently asked question about <guid>s is how do they compare to <link>s. Aren't they the same thing? Yes, in some content systems, and no in others. In some systems, <link> is a permalink to a weblog item. However, in other systems, each <item> is a synopsis of a longer article, <link> points to the article, and <guid> is the permalink to the weblog entry. In all cases, it's recommended that you provide the guid, and if possible make it a permalink. This enables aggregators to not repeat items, even if there have been editing changes."

This best practice recommendation, which is no doubt the way to go, also alludes to the fact that a bunch of feeds follow different conventions. This is precisely the sort of problem I've run into. Already re-usability is out the window, and I haven't even tried to pick up the mouse yet for a session of point-and-click programming. For each feed I add to the portal, I now study its XML before deciding on what set of parameters to apply.

But the GUID vs. link problem isn't our only challenge.

Some feeds, like the one from Dave Winer's Scripting News, have also thrown our portal a curveball. Winer doesn't title his items. This is a problem because in creating an easy-to-scan portal of 20 or more feeds, we've decided that the easiest thing to is to show just the item titles and then link the titles back to the full text (using the item links or GUIDs as described above, whichever is more appropriate). In Winer's XML, however, the pickings are slim. With no title to select, there are only three other choices — the GUID, the item's pubDate (publication date), and the full text of the item (the description) itself. But with the full text of the item running anywhere from a few words to a few paragraphs, it doesn't make sense to use it as a hyperlink.

As with Schwartz's blog, Winer's GUIDs are URIs to the full text, which means the only thing left for us to use as linkable text was the pubDate. The folks at Mozilla.org apparently feel the same way about titleless items. Firefox, which uses a feature called Live Bookmarks to track RSS feeds, also keys off of pubDate when generating its menu of clickable links for feeds without titles. In fact, so good is Firefox at handling the non-uniform usage of RSS that it deftly handles channels like John Robb's, which, just within one blog, applies titles to some entries and not to others. After adding Robb's feed as a Live Bookmark to Firefox, the resulting menus display whatever Firefox can find for each item - pubDates for some, titles for others. This is evidence of how, with Web syndication, the consumption side is bearing the burden of the choices being made on supply side. In other words, control is shifting from away from vendors to the content publishers. Note that this phenomenon is somewhat reverses the direction that the Web took. (Given the popularity of Internet Explorer and how many sites don't work in Firefox, hindsight demonstrates how the supply-side adjusted to the consumption side.)

Likewise, as a testimony to how users are typically sheltered from some fairly complex decisions and algorithms that software is making on their behalf, Dave Winer's Web-based feed aggregator also does an admirable job of normalizing a variety of feed conventions into a single interface that intermingles entries from different channels in reverse chronological order according to when the aggregator polled the channel. In other words, five entries may show up from one channel at 12:15, but the oldest of them may not be newer than a previously listed entry from another channel. Neither, however, presents pubDate according to the end-user's time zone. For Web-based aggregators, I'm not sure if it's possible to pick up the end-user's time zone. But it is for locally run RSS aggregators like Firefox and Newsgator. See the room to grow?

Initially, I cursed Winer for not including titles. But once you start following Winer's Weblog, you realize the economics of his choice. His blog is just a stream of consciousness. When you're thinking about something (anything), do you give it a title first? Neither does Winer — nor should he have to. What makes his and other blogs great, and what distinguishes them from news feeds that come with titles (their headlines), is that the blogs are like diaries. They're full of frequent entries that might not be up-to-date with the writers' minds if the writers had to stop to put a title on everything. There are exceptions. Microsoft's Robert Scoble, another of the Web's more popular bloggers, manages to put titles on all of his entries regardless of how short they are. For example, on a recent entry regarding something Microsoft CEO Steve Ballmer said about Apple's iPod, the title is almost as long as the full-text of the entry. Perhaps if he titled fewer entries, or none at all, we'd get more out of Scoble's brain.

In the name of building a re-usable set of parameters (so that others could just cut and paste it), the more I stared at Winer's feed, the more I struggled between the two choices: using the pubDate as my linkable text in our TWiki-based portal, or just downloading the full text (stored in each item's description) from his entries and showing that in our portal. After all, with our feed depth only covering the last five entries of any given feed, and with Winer routinely publishing more than five entries a day, having a list of pubDates doesn't tell us something that we don't already know other than the exact time each entry was published. What we really needed is some indication of what's in the full text itself. In the case of titleless feeds like Winer's, taking the full text was pretty much our only choice (given the plug-in's capabilities).

In fact, taking everything (GUID, description, pubDate, and anything else a feed may have to offer) in one trip started to look like the best universal approach for our portal. There, it was settled. Finally, I could get back to my day job. Well, not really.

As Winer pointed out to me, that approach might not work either because, unlike many bloggers, news publishers will often provide summaries in the description field instead of the full text of their content. To make matters worse, after grabbing the descriptions as well, I discovered that the TWiki's headline plug-in couldn't handle the HTML in which they're invariably written.

This project is like the leaky dam in those old cartoons. Just when you think you have all the holes plugged, another one springs open. I found myself wishing that the resolution was only a point and click away, like the vendors keep promising. But, at the rate we're going, I suspect that it's years away.

Still, this remains a story about what is right with RSS — and why its brand of flexibility will make life difficult for all those (vendors, for example) attempting to bring order to the natural chaos that defines the Internet.

You can write to me at david.berlind@cnet.com. If you're looking for my commentaries on other IT topics, check my blog Between the Lines.

Editor's Picks

Free Newsletters, In your Inbox