Continuing with making it easier for “Big Pubs” to create RSS feeds. I’m assuming that they have a publishing system, but it wasn’t built with RSS in mind, but they want on the bandwagon.
As a start, rather than retrofitting their existing publishing system, they can scrape their HTML. All they need are curl or wget, tidy, and an XSLT processor.
As an example, I’m going to build an RSS 0.91 feed for The Nation. They already have an RSS feed for their editor’s weblogs, but not one with the featured stories from their home page. (Creating an RSS for a right-wing publication is left as an exercise for the interested reader.)
What I want to pick off are the three top stories, pictured below.

Fortunately, The Nation’s designers use CSS, so those headlines are easy to pick off.
First, I need to grab the page’s source and, unfortunately, clean it up. So use curl to fetch it, and run it through tidy to clean it up in XHTML.
% curl http://www.thenation.com/ | /usr/local/bin/tidy -asxml --indent
yes –doctype strict –output-encoding latin1 –force-output yes > nation.html
I’m using several of tidy’s options here, most important are force-output to make tidy return something, and doctype in order to be able to resolve entity references in the XSLT processor step.
With source XHTML created, I can write some XSLT. I’ve put the full style sheet up for you to view.
As mentioned above, The Nation’s designers use CSS and class attributes in the source, so that gives XPath something to hook on to. (Oh, this brings back memories of working on client code at 2Roam.)
The three headlines are wrapped in div elements with the class ‘tn5′ or ‘tnhphed’. So I’ll grab them with the XPath expression:
//html:div[@class = 'tn5' or @class = 'tnhphed']
The link to the story is a child of that div, and the short description is a following-sibling of the div. The template to write out a RSS 0.91 link element looks like:
<xsl:template match="html:div">
<item>
<title>
<xsl:value-of select="normalize-space(html:a)"/>
</title>
<link>
<xsl:value-of select="concat($BASE,html:a/@href)" />
</link>
<description>
<xsl:value-of
select="normalize-space(following-sibling::html:div[@class = 'tntext'])"/>
</description>
</item>
</xsl:template>
With the style sheet, and a command line processor such as xsltproc, it’s easy enough for The Nation’s webmaster to write a crontab entry that fetches the front page, tidies it, and transforms the result into RSS that they can drop onto their webserver.
This, however, is a stopgap, because as soon as someone redesigns the front page, the style sheet will break. But because we forced the inputs to XHTML using tidy, and generated the result using XSLT, you have well-formed output. The style sheet writer still has to make sure that the feed validates, but the things that kill XML parsers dead in their tracks are taken care of.