|RSS, gzip and 304
||[Jan. 27th, 2004|09:26 pm]
I currently run an aggregator of sorts (it's web-based, and only checks for updates - it doesn't parse the content). Until recently, I was just using python's urllib, so I fetched the entire page each time. I'm currently changing over to use the http routines from Mark Pilgrim's Feed Parser, which is based on urllib2, and supports 304 conditional GET (last-modified and etag headers) as well as gzip compressed webpages (accept-encoding: gzip).
Now, the main journal pages can be sent gzipped (content-encoding: gzip), but are not cacheable (no last-modified or etag header). RSS and Atom feeds are cacheable (last-modified header), but are not gzipped. Since RSS and Atom are both full feeds now, the size of the raw RSS/Atom and the HTML journal page is pretty much the same - except that the HTML can be gzipped, dramatically reducing the size (3 or 4 times smaller is quite common).
This leaves me with a dilemma - should I check the HTML page, since it's smaller, or RSS/Atom (btw, Atom is larger than RSS) because I can use last-modified and it will only be downloaded when a change has occurred. Currently, the answer would be to check the RSS, unless the journal is so active it's likely to update every 3 times you check the RSS, in which case the savings of gzip mean it's better to check the main journal.
The real solution would be for gzip content-encoding to be supported on the RSS/Atom pages, giving us the best of both worlds. Is there any technical impediment to this?
Other useful links: the Cacheability Engine and a discussion about this with respect to other blogging software and RSS feeds.
Update: I forgot to mention that the cacheability report for livejournal thinks that the livejournal servers' clocks are about an hour behind real time. Ignore that, as nugget points out that web-caching.com are fools.