April 2nd, 2003

Error on talkread.bml


I just worked on something, and all of a sudden, completely out of the blue, talkread.bml started giving me this error while trying to access the URL http://timwi.dyndns.org/users/timwi/809.html?nc=3:

We're currently working on something. The site will be back up shortly.

I've reverted everything back to newest CVS code, but the error persists.

Collapse )

UPDATE: There is a connection between this and http://zilla.livejournal.org/show_bug.cgi?id=823. Other talkread.bml pages work fine on my server; this error occurs when there's invalid UTF-8 somewhere in a comment.
computer crap

New Zilla keyword: need-improvement

We have added a new keyword to signify that a patch has been reviewed (not to be confused with the 'reviewed' keyword), but has been found to have problems which need to be addressed.

If a patch has the need-improvement keyword, then it should have some identified problems which should be fixed by the submitter of the patch.

We also very minorly changed the description of the 'reviewed' keyword.

Results of my investigation

I believe there is a major bug hidden in the HTML cleaner.

I have tried creating a comment with the following contents:
<a href="http://www.server.com/url?val=v&aring;l">Test</a>

The HTML cleaner lets HTML::TokeParser take everything apart and then tries to put everything together again. However, HTML::TokeParser completely decodes the attribute's value and thus replaces the &aring; entity with an actual å character - furthermore, it uses Latin-1, not UTF-8, when doing so.

Apart from escaping &/</>, LiveJournal's HTML cleaner uses this information 'as is'. As a result, there is a stray Latin-1 å, and therefore invalid UTF-8, in the final output.

However, this isn't where problems end. Following the bugreport at Bug 821, I tried using the Unicode character &#23376; in a URL. What happens on my computer is this: I have perl 5.8, so it does wide strings (strings with Unicode characters), so HTML::TokeParser replaces this entity with that Unicode character. Then something bails out because something can't handle it, and that is what caused the error I mentioned in my previous posting ("Sorry - We're currently working on something. The site will be back up shortly.").

However, I believe LiveJournal only uses perl 5.6, which does not use wide strings yet. As a result, I think, HTML::TokeParser keeps the HTML entity as is because it can't handle it. The result is that LJ::ehtml re-escapes the ampersand and thus the URL gets b0rked.

I honestly don't know what would be the best way to handle this. If HTML::TokeParser interprets some entities, but keeps others intact, we cannot reliably predict what the user originally entered. Additionally, I really have no clue how to stop my system from giving the "We're working on something" error.