Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSS fixes #242

Merged
merged 8 commits into from Mar 18, 2019
Merged

RSS fixes #242

merged 8 commits into from Mar 18, 2019

Conversation

jameysharp
Copy link
Contributor

Here are an assortment of fixes for the news-rss.xml RSS feed. Most of these commits just fix issues reported by http://www.feedvalidator.org although there are a couple of notable side effects.

  1. Since feed items did not include a <guid> before, adding one causes all items to appear unread. This is unfortunate but avoids bigger potential problems later.

  2. Since the date format in news.xml was not consistent with the RSS specification (which requires RFC-822 format), the HTML version of the news changed from "month day year" order to "day month year". This is fixable in a couple of different ways if it matters to you.

As long as I was doing all that, I also took care of my personal crusade, which is RFC5005 "Feed Paging and Archiving" support. Hardly anybody supports it, but it's really easy to do if you're sticking your complete history in the feed anyway, so I figured I might as well.

The current news-rss.xml fails http://www.feedvalidator.org for a couple
of reasons, making feed readers like Liferea misparse it. This patch
fixes one of those issues.

As @edolstra noted in #90, HTML is not allowed directly in the RSS
`<description>` tag. Instead it's supposed to be a text node, where the
text can be parsed as HTML. So e.g. an `<a>` tag needs to be written in
the feed as `&lt;a&gt;`.

This could be handled more easily in an Atom feed, which allows tags
from namespaces like XHTML to be embedded directly as long as the
namespaces are declared properly.

This could also be simpler if xsltproc supported the XPath Functions 3.0
standard, which defines a `serialize` function that does what the
mode="serialize" template does in this patch.
The current news-rss.xml fails http://www.feedvalidator.org for a couple
of reasons, making feed readers like Liferea misparse it. This patch
fixes one of those issues.

The RSS pubDate element is required to be an RFC-822 date-time. The
entries in news.xml did not conform to that specification for two
reasons: a missing comma, and swapping the month and day fields.

So I fixed both of those issues in news.xml, and then fixed news.xsl to
extract the date substring at its new offset.

This does mean that the HTML version of the news has its month and day
swapped now, and frankly I liked it better in the previous month-day
order.

If desired the previous output can be recovered, by either continuing
using the substring-pasting approach but chopping up the substrings
further, or changing news.xml to use ISO 8601 date-times and using the
http://exslt.org/date/ extension functions.
The channel's <link> tag should point to an HTML version of the same
content as the feed, so it's better to link to /news.html than to /.

In addition, the RSS specification says:

"the image <title> and <link> should have the same value as the
channel's <title> and <link>."

http://www.rssboard.org/rss-specification#ltimagegtSubelementOfLtchannelgt

Finally, feedvalidator.org says the feed document should include a link
to the canonical URL for that feed:

http://www.feedvalidator.org/docs/warning/MissingAtomSelfLink.html
The item's <link> tag should point to an HTML version of the same
content as the item. However, there is no URL for each item in the
current site, so there's no good place to link to. Fortunately, the link
tag is not required when a description is present, so we can just drop
that tag entirely.

Also, each item should have some sort of unique ID in the <guid> tag.
If a feed reader sees an item that's different from any item it has seen
before, then the guid allows the reader to distinguish between new items
and items that have been edited.

http://www.feedvalidator.org/docs/warning/MissingGuid.html

Since it has almost never happened that two items were posted with the
same pubDate, I've chosen to use the pubDate itself as the guid. There
were three duplicates, but I've made them unique by adding one second to
the pubDates of the three duplicates that appear earlier in the file, so
that sorting by descending pubDate would leave the file order unchanged.
First, I removed the maxItem limit from the RSS feed. (It's still used
for the HTML version.) The maxItem limit of 1000 was effectively
infinite at the rate news items are being published, and each item is
quite small so there isn't much bandwidth cost to publishing a lot of
them anyway.

Second, since the feed does contain every entry ever posted, it is a
valid "complete feed" in the sense of RFC5005 section 2, so add the tag
marking it as such.

If the RSS feed ever gets too large, then RFC5005 section 4 describes
how to paginate it into archived feeds.
@samueldr
Copy link
Member

samueldr commented Oct 10, 2018

Hi! Thanks for the contribution!

This repository is much less frequented and has much fewer people with approval and merge rights; don't be alarmed (yet) by the lack of traffic around this PR.


👍 eyballed each commits; good separation, easy to understand. I also verified using nix-shell --run "make && python2 -m SimpleHTTPServer 8000" that everything built as expected. I also verified that the escapes in the feed are as expected.

This'll need a merge/rebase on top of the last news update (sorry), but since your commits are so well-behaved I bet it's going to be easy!

Then, once updated, LGTM 🎉!

@grahamc grahamc merged commit c2b7827 into NixOS:master Mar 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants