Evan Martin (evan) wrote in lj_dev,
Evan Martin

control characters in XML

UTF-8 (and Unicode) define codepoints less than 32. They're all control characters, but they're allowed.

Unfortunately, XML 1.0 only allows a subset of those characters, specifically:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
The question is: how do we produce RSS feeds of data that have characters like 0x07? (See also bug 1411, where the problem is reported but not identified.)

One solution is to just strip them. They're not really necessary because they can't be displayed anyway, and their existence in journals is due to bugs*.
The alternative is to encode them with an entity reference. Is that overkill?
Encoding them with an entity reference still produces invalid XML, so I can't see any other options. (See: ASCII control characters in XML.)

Either way, our latest-rss.bml feed is producing invalid XML pretty regularly.

* (A better question is: where are these characters coming from? I believe it's from Netscape 4.x-era browsers, which append a random byte when submitting UTF-8 forms, or something like that. Here's a random example.
Are there any valid use cases for preserving these bytes instead of just stripping them?)

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded