April 2nd, 2003

  • timwi

Error on talkread.bml

Hi,

I just worked on something, and all of a sudden, completely out of the blue, talkread.bml started giving me this error while trying to access the URL http://timwi.dyndns.org/users/timwi/809.html?nc=3:

Sorry
We're currently working on something. The site will be back up shortly.

I've reverted everything back to newest CVS code, but the error persists.

Collapse )

UPDATE: There is a connection between this and http://zilla.livejournal.org/show_bug.cgi?id=823. Other talkread.bml pages work fine on my server; this error occurs when there's invalid UTF-8 somewhere in a comment.
computer crap

New Zilla keyword: need-improvement

We have added a new keyword to signify that a patch has been reviewed (not to be confused with the 'reviewed' keyword), but has been found to have problems which need to be addressed.

If a patch has the need-improvement keyword, then it should have some identified problems which should be fixed by the submitter of the patch.

We also very minorly changed the description of the 'reviewed' keyword.
  • timwi

Results of my investigation

I believe there is a major bug hidden in the HTML cleaner.

I have tried creating a comment with the following contents:
<a href="http://www.server.com/url?val=v&aring;l">Test</a>

The HTML cleaner lets HTML::TokeParser take everything apart and then tries to put everything together again. However, HTML::TokeParser completely decodes the attribute's value and thus replaces the &aring; entity with an actual å character - furthermore, it uses Latin-1, not UTF-8, when doing so.

Apart from escaping &/</>, LiveJournal's HTML cleaner uses this information 'as is'. As a result, there is a stray Latin-1 å, and therefore invalid UTF-8, in the final output.

However, this isn't where problems end. Following the bugreport at Bug 821, I tried using the Unicode character &#23376; in a URL. What happens on my computer is this: I have perl 5.8, so it does wide strings (strings with Unicode characters), so HTML::TokeParser replaces this entity with that Unicode character. Then something bails out because something can't handle it, and that is what caused the error I mentioned in my previous posting ("Sorry - We're currently working on something. The site will be back up shortly.").

However, I believe LiveJournal only uses perl 5.6, which does not use wide strings yet. As a result, I think, HTML::TokeParser keeps the HTML entity as is because it can't handle it. The result is that LJ::ehtml re-escapes the ampersand and thus the URL gets b0rked.

I honestly don't know what would be the best way to handle this. If HTML::TokeParser interprets some entities, but keeps others intact, we cannot reliably predict what the user originally entered. Additionally, I really have no clue how to stop my system from giving the "We're working on something" error.