Log in

No account? Create an account
LiveJournal Development [entries|archive|friends|userinfo]
LiveJournal Development

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

January 26th, 2004

control characters in XML [Jan. 26th, 2004|11:00 am]
LiveJournal Development
UTF-8 (and Unicode) define codepoints less than 32. They're all control characters, but they're allowed.

Unfortunately, XML 1.0 only allows a subset of those characters, specifically:
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
The question is: how do we produce RSS feeds of data that have characters like 0x07? (See also bug 1411, where the problem is reported but not identified.)

One solution is to just strip them. They're not really necessary because they can't be displayed anyway, and their existence in journals is due to bugs*.
The alternative is to encode them with an entity reference. Is that overkill?
Encoding them with an entity reference still produces invalid XML, so I can't see any other options. (See: ASCII control characters in XML.)

Either way, our latest-rss.bml feed is producing invalid XML pretty regularly.

* (A better question is: where are these characters coming from? I believe it's from Netscape 4.x-era browsers, which append a random byte when submitting UTF-8 forms, or something like that. Here's a random example.
Are there any valid use cases for preserving these bytes instead of just stripping them?)
link4 comments|post comment

[ viewing | January 26th, 2004 ]
[ go | Previous Day|Next Day ]