As those who have worked on non-HTML S2 layouts will know, the
set_content_type() builtin function remains unimplemented despite it being a couple of years now since S2 went properly live. There's actually a patch waiting in Zilla which implements this and the other remaining builtins, but it is blocked because of an issue which I think needs some discussion.
The issue at hand is how to deal with sanitizing output for non-HTML formats. There are plenty of formats which can embed scripts and get similar access to what LiveJournal is trying to prevent by filtering scripts in the HTML cleaner. Currently this isn't an issue, because everything is being served as text/html and so browsers don't try to interpret these scripts. However, if
set_content_type() were allowed, a user could generate malicious code in a format that we don't know how to clean and browsers would go ahead and run it.
There are several ways to tackle this problem, and none of them are really satisfactory:
- Only run the HTML cleaner for text/html. This is what would happen if the patch linked above was committed as-is. An obvious problem here is that any XML document can import the XHTML namespace and with it the
script element, which IE dutifully obeys regardless of the containing document format. Even ignoring XHTML, XUL has its own
script element which would allow an attacker to exploit Mozilla.
- Run the HTML cleaner on everything, regardless of Content-type. This is the approach currently taken on Fotobilder. This approach isn't nice, because it makes it impossible to generate certain types of XML document. For example, you cannot generate valid RSS because the HTML cleaner turns
pubdate, and XUL's
iframe element gets eaten just like its HTML counterpart. I suspect that the HTML cleaner chokes on XML namespaces as well.
- Have a whitelist of acceptable types which are either safe or that we have some kind of cleaner for. A namespace-aware XML cleaner (possibly with an XML namespace whitelist, too) wouldn't be too hard to write to complement the HTML cleaner, so we could safely support text/xml. However, this means that any time a fancy new document format comes along the whitelist must be updated or users won't be able to use it.
Of these, I think the last option is the least distasteful. We will have to watch out for certain “helpful” browser behaviors, though, such as IE's tendency to interpret text/plain as HTML if the document has a bunch of angle brackets.
It'd be nice to get these last builtin functions implemented, so I'd welcome any more ideas or thoughts on the ideas I've already listed.