January 27th, 2004

Porn & captchas

Someone told me about an ingenious way that spammers were cracking "captchas" -- the distorted graphic words that a human being has to key into a box before Yahoo and Hotmail and similar services will give them a free email account. The idea is to require a human being and so prevent spammers from automatically generating millions of free email accounts.

The ingenious crack is to offer a free porn site which requires that you key in the solution to a captcha -- which has been inlined from Yahoo or Hotmail -- before you can gain access. Free porn sites attract lots of users around the clock, and the spammers were able to generate captcha solutions fast enough to create as many throw-away email accounts as they wanted.


RSS, gzip and 304

Some background:
I currently run an aggregator of sorts (it's web-based, and only checks for updates - it doesn't parse the content). Until recently, I was just using python's urllib, so I fetched the entire page each time. I'm currently changing over to use the http routines from Mark Pilgrim's Feed Parser, which is based on urllib2, and supports 304 conditional GET (last-modified and etag headers) as well as gzip compressed webpages (accept-encoding: gzip).

Now, the main journal pages can be sent gzipped (content-encoding: gzip), but are not cacheable (no last-modified or etag header). RSS and Atom feeds are cacheable (last-modified header), but are not gzipped. Since RSS and Atom are both full feeds now, the size of the raw RSS/Atom and the HTML journal page is pretty much the same - except that the HTML can be gzipped, dramatically reducing the size (3 or 4 times smaller is quite common).

This leaves me with a dilemma - should I check the HTML page, since it's smaller, or RSS/Atom (btw, Atom is larger than RSS) because I can use last-modified and it will only be downloaded when a change has occurred. Currently, the answer would be to check the RSS, unless the journal is so active it's likely to update every 3 times you check the RSS, in which case the savings of gzip mean it's better to check the main journal.

The real solution would be for gzip content-encoding to be supported on the RSS/Atom pages, giving us the best of both worlds. Is there any technical impediment to this?

Other useful links: the Cacheability Engine and a discussion about this with respect to other blogging software and RSS feeds.

Update: I forgot to mention that the cacheability report for livejournal thinks that the livejournal servers' clocks are about an hour behind real time. Ignore that, as nugget points out that web-caching.com are fools.