Timwi (timwi) wrote in lj_dev,

Results of my investigation

I believe there is a major bug hidden in the HTML cleaner.

I have tried creating a comment with the following contents:
<a href="http://www.server.com/url?val=v&aring;l">Test</a>

The HTML cleaner lets HTML::TokeParser take everything apart and then tries to put everything together again. However, HTML::TokeParser completely decodes the attribute's value and thus replaces the &aring; entity with an actual å character - furthermore, it uses Latin-1, not UTF-8, when doing so.

Apart from escaping &/</>, LiveJournal's HTML cleaner uses this information 'as is'. As a result, there is a stray Latin-1 å, and therefore invalid UTF-8, in the final output.

However, this isn't where problems end. Following the bugreport at Bug 821, I tried using the Unicode character &#23376; in a URL. What happens on my computer is this: I have perl 5.8, so it does wide strings (strings with Unicode characters), so HTML::TokeParser replaces this entity with that Unicode character. Then something bails out because something can't handle it, and that is what caused the error I mentioned in my previous posting ("Sorry - We're currently working on something. The site will be back up shortly.").

However, I believe LiveJournal only uses perl 5.6, which does not use wide strings yet. As a result, I think, HTML::TokeParser keeps the HTML entity as is because it can't handle it. The result is that LJ::ehtml re-escapes the ampersand and thus the URL gets b0rked.

I honestly don't know what would be the best way to handle this. If HTML::TokeParser interprets some entities, but keeps others intact, we cannot reliably predict what the user originally entered. Additionally, I really have no clue how to stop my system from giving the "We're working on something" error.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded