|control characters in XML
||[Jan. 26th, 2004|11:00 am]
UTF-8 (and Unicode) define codepoints less than 32. They're all control characters, but they're allowed.|
Unfortunately, XML 1.0 only allows a subset of those characters, specifically:
 Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
The question is: how do we produce RSS feeds of data that have characters like 0x07? (See also bug 1411, where the problem is reported but not identified.)
One solution is to just strip them. They're not really necessary because they can't be displayed anyway, and their existence in journals is due to bugs*.
The alternative is to encode them with an entity reference. Is that overkill?
Encoding them with an entity reference still produces invalid XML, so I can't see any other options. (See: ASCII control characters in XML.)
Either way, our latest-rss.bml feed is producing invalid XML pretty regularly.
* (A better question is: where are these characters coming from? I believe it's from Netscape 4.x-era browsers, which append a random byte when submitting UTF-8 forms, or something like that. Here's a random example.
Are there any valid use cases for preserving these bytes instead of just stripping them?)