Results of my investigation
I believe there is a major bug hidden in the HTML cleaner.
I have tried creating a comment with the following contents:
<a href="http://www.server.com/url?val=vål">Test</a>
The HTML cleaner lets HTML::TokeParser take everything apart and then tries to put everything together again. However, HTML::TokeParser completely decodes the attribute's value and thus replaces the å entity with an actual å character - furthermore, it uses Latin-1, not UTF-8, when doing so.
Apart from escaping &/</>, LiveJournal's HTML cleaner uses this information 'as is'. As a result, there is a stray Latin-1 å, and therefore invalid UTF-8, in the final output.
However, this isn't where problems end. Following the bugreport at Bug 821, I tried using the Unicode character 子 in a URL. What happens on my computer is this: I have perl 5.8, so it does wide strings (strings with Unicode characters), so HTML::TokeParser replaces this entity with that Unicode character. Then something bails out because something can't handle it, and that is what caused the error I mentioned in my previous posting ("Sorry - We're currently working on something. The site will be back up shortly.").
However, I believe LiveJournal only uses perl 5.6, which does not use wide strings yet. As a result, I think, HTML::TokeParser keeps the HTML entity as is because it can't handle it. The result is that LJ::ehtml re-escapes the ampersand and thus the URL gets b0rked.
I honestly don't know what would be the best way to handle this. If HTML::TokeParser interprets some entities, but keeps others intact, we cannot reliably predict what the user originally entered. Additionally, I really have no clue how to stop my system from giving the "We're working on something" error.
I have tried creating a comment with the following contents:
<a href="http://www.server.com/url?val=vål">Test</a>
The HTML cleaner lets HTML::TokeParser take everything apart and then tries to put everything together again. However, HTML::TokeParser completely decodes the attribute's value and thus replaces the å entity with an actual å character - furthermore, it uses Latin-1, not UTF-8, when doing so.
Apart from escaping &/</>, LiveJournal's HTML cleaner uses this information 'as is'. As a result, there is a stray Latin-1 å, and therefore invalid UTF-8, in the final output.
However, this isn't where problems end. Following the bugreport at Bug 821, I tried using the Unicode character 子 in a URL. What happens on my computer is this: I have perl 5.8, so it does wide strings (strings with Unicode characters), so HTML::TokeParser replaces this entity with that Unicode character. Then something bails out because something can't handle it, and that is what caused the error I mentioned in my previous posting ("Sorry - We're currently working on something. The site will be back up shortly.").
However, I believe LiveJournal only uses perl 5.6, which does not use wide strings yet. As a result, I think, HTML::TokeParser keeps the HTML entity as is because it can't handle it. The result is that LJ::ehtml re-escapes the ampersand and thus the URL gets b0rked.
I honestly don't know what would be the best way to handle this. If HTML::TokeParser interprets some entities, but keeps others intact, we cannot reliably predict what the user originally entered. Additionally, I really have no clue how to stop my system from giving the "We're working on something" error.
