January 9th, 2004

amused, happy
  • mart

Rambling about Timezones

I read over Brad's ancient document on timezones (August 2000!) to remind myself of the issues, so here's some rambling on the subject of timezone support in LJ.

The current state of affairs is that the flat protocol (and XML-RPC too, I hope) supports a “tz” field for postevent which can either be set to “guess” to have the server guess a timezone or a four-digit time offset, such as +0100, -0700 or +0000. (That's HHMM, by the way). At the moment not a lot is done with this and no clients actually send it, but at least the thought is there. As far as I know, logtime and comment post times are now being stored as GMT (Brad?) which is also a good start.

With this in mind, I think we need to do some or all of the following:

  • Have a GMT offset stored for each entry, and keep entrytime in GMT. On post, use the timezone field to calculate a GMT version of the given event time and store that, along with the current time in GMT.

  • If a value for timezone (other than “guess”) is sent with postevent, allow all of the date and time fields to be omitted, saying to the server (which is pretty accurate since it's set to a good clock by ntp) “use the current time”. It's an error to include some fields but not others, though; all or nothing. This becomes the recommended way to submit any entry that does not have the date explicitly set by the user, thus removing the problem of client-side clocks being wrong.

  • Clients should use some locally-defined way to detect a sensible timezone. Windows “knows” what timezone it is in, as does every other major OS I'd imagine. I know that both Windows and my Linux boxes manage to track daylight savings time properly too. This setting should be the default, but users should be able to override it if they know better. (how they override it is an implementation issue) (Can the “web client” determine the local timezone from JavaScript?)

  • Allow users to select the following two settings for their account:

    • A “display” timezone used for display purposes whenever something needs to be viewed in the user's local time. This could possibly be overridable with a cookie for non-users or users who want to change their setting temporarily.
    • A “default” zone used if a client sends a postevent with no timezone setting; we simply assume the given time is in the default timezone.

    These are the only two timezone settings here which are stored, rather than as absolute GMT offsets, as identifiers which map onto some settings describing daylight savings time etc. Data on the special rules for different regions is readily available and even comes with most linux distros. I'm sure there's a perl module around which can parse and use this data. The purpose of storing these this way is to allow for automagically using the correct offset at different times of year without the user explicitly changing it.

  • Sort all journal views by the GMT value stored in eventtime. Friends view will no longer sorted by logtime. (Yay!)

  • For S1, just supply the time to the styles in the poster's timezone, mimicking current behavior. The friends page will still do the wonky behavior with start/end day, but at least it won't get any worse.

  • S2 improvements abound:

    • Extend Date to contain a new member containing the timezone offset as a string like the timezone field in the protocol, and give a way to resolve that to a human-friendly (and localized) display string such as “PDT” or “BST”.

    • Supply a new member in the S2 EntryLite class called localtime. This will contain the time in the timezone set as default by the journal owner. This may sound a bit weird since we're ignoring the viewer's selected display timezone, but current convention is that in S2 the journal owner wins: we don't use the viewer's browselang for language, for example.

    • The time in EntryLite goes on being in the poster's timezone: the values are set as for S1's time. We have the new field containing the timezone, however, so this is no longer ambiguous.

    • System layouts can be modified to display both the poster's time and the local time, but only if the two fields have different timezones. No sense in displaying redundant data; remember that localtime is calculated from the same GMT time as eventtime, so if the timezones are the same they will be identical.

    • The “new day” stuff should group based on the local time field, not on the poster's time. This is of most importance on the friends view, but is also important on the recent view where entries could be posted by users in different timezones (community/shared journals) or by a single user who goes to a different timezone temporarily.

    Existing non-system layouts will go on doing what they do now: displaying the poster's time. This means the layout owner can go back and add the new time at their leisure. (In other words, it's backwards compatible)

  • Pre-timezone entries will assume the poster's default timezone. The user can manually set the timezone on any entries for which this is not correct. (I assume most people stay in one timezone most of the time)

  • The protocol will block posting of entries without a timezone specification until a default timezone has been selected. This is probably the most controversial thing here, but it will encourage — neigh, force — users to choose their timezone, avoiding incorrect data accumulating while they don't realise they need to. The web-based interface could potentially be clever and just present the “choose default timezone” form to the user the first time they try to post without one set, being very careful to emphasis that it's a default timezone, not just for this entry.

  • New users will be required to choose their timezone on signup. This could be part of a new hand-holdy multi-step signup process which allows users to pick their default language (later), default/display timezones and an initial S2 layout from a subset of the available system layouts. (The last of these is inspired by Blogger's signup process, and I recieved good feedback on this idea when I proposed it a few years back.)

Well, that was long. Still, Brad said he was thinking about moving up to dataversion 3 with some major changes, so timezones might as well be one of these major changes. If anyone is brave enough to read this big, rambly list it'd be cool to discuss this.

Quick note: I don't have time to work on this stuff right now, and it's probably not going to be done for a while yet. :)

Help needed: profiling LJ

Anybody good at Perl profiling?

I sped up the HTML cleaner a bunch today, using the Benchmark module. That works when you already know a major area that's a hot spot and slow, which I knew the HTML cleaner was.

But, when you don't know where to look, you turn to profilers, like Devel::DProf, which apparently works okay in Perl 5.8, but not Perl 5.6 (where it just segfaults).

Anyway, I obtained the following from Apache::DProf on a Perl 5.8 development machine:


(just haphazardly clicking around my dev install trying to emphasize hits on where we get the most: lastn, friends, comments, userinfo, etc)

# dprofpp -u -O 20
Apache::LiveJournal::journal_content has -1 unstacked calls in outer
Exporter::import has -1 unstacked calls in outer
AutoLoader::__ANON__ has 1 unstacked calls in outer
Compress::Zlib::__ANON__ has 1 unstacked calls in outer
IO::Socket::import has -1 unstacked calls in outer
AutoLoader::AUTOLOAD has -1 unstacked calls in outer
Exporter::Heavy::heavy_export has 2 unstacked calls in outer
Total Elapsed Time = 98.84301 Seconds
         User Time = 3.843011 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 31.7   1.219  0.047     20   0.0610 0.0024  Apache::LiveJournal::journal_content
 31.3   1.204  1.709     10   0.1204 0.1709  Apache::BML::handler
 6.25   0.240  0.322   1464   0.0002 0.0002  BML::__ANON__
 5.85   0.225  0.232    785   0.0003 0.0003  DBD::mysql::db::prepare
 2.84   0.109  0.139     56   0.0020 0.0025  Apache::LiveJournal::db_logger
 2.84   0.109  0.127     56   0.0019 0.0023  Apache::LiveJournal::trans
 2.81   0.108  0.125      8   0.0135 0.0156  LJ::Talk::load_comments
 1.04   0.040  0.050      8   0.0050 0.0062  LJ::load_codes
 1.01   0.039  0.049     16   0.0025 0.0030  LJ::load_user_props
 1.01   0.039  0.038     86   0.0005 0.0004  DBD::mysql::dr::connect
 0.96   0.037  0.037    842   0.0000 0.0000  DBI::st::fetch
 0.78   0.030  0.030     68   0.0004 0.0004  BML::Cookie::FETCH
 0.78   0.030  0.030     76   0.0004 0.0004  LJ::img
 0.75   0.029  0.037     57   0.0005 0.0007  LJ::get_remote
 0.75   0.029  0.037     56   0.0005 0.0007  LJ::end_request
 0.70   0.027  0.027    871   0.0000 0.0000  DBD::_mem::common::DESTROY
 0.52   0.020  0.020     19   0.0010 0.0010  Apache::BML::reset_codeblock
 0.52   0.020  0.020      6   0.0033 0.0033  LJ::get_bio
 0.52   0.020  0.020      1   0.0198 0.0196  LJ::Talk::Post::init
 0.52   0.020  0.039     14   0.0014 0.0028  LJ::load_userpics

Now, why are Apache::LiveJournal::journal_content and Apache::BML::handler reporting so much CPU usage? They hardly do anything. They do, however, call the XS-based gzip code, which I believe Devel::DProf can't trace, so it's reported in the caller instead. (bleh) So I guess we have to ignore both those, which is 60% of the runtime.

Unless: I wonder if a Debian package of the Compress::Zlib built for 686 would help at all. Anybody able to profile the difference for me, or give me pointers on the best Debian way to go about that?

Moving on, we can't get at the profiling of BML::__ANON__ stuff (which is all BML pages). But... we for profiling have a special BML mode that compiles all BML pages to named subs instead of anonymous subs. That might prove interesting.

DBI stuff: ignore. I wasn't profiling with memcache. That stuff mostly goes away.

Apache::LiveJournal::db_logger/::trans: again, those just call a lot of XS stuff, they're not actually slow. (::trans might help to improve, but db_logger is simple)

load_comments.... it is big and perhaps CPU heavy, but I don't see any obvious way to improve it, and it's fragile, so I don't want to really mess with it.

I think the big thing to do here is investigate how to make gzip stuff faster, if possible, and break up the BML reporting so we can see where we really need to work on.

The good thing is that the HTML cleaner doesn't even show up in that list anymore.

Curious: why don't I see any S1 or S2 function calls in the dprofpp trace? (dprofpp -T shows no match for "create_view" or "LJ::S2", which it should?) Does Devel::Dprof exclude certain things? I don't see the pattern. Oh, probably because all that S1/S2/"make_journal" stuff is called from a closure within Apache::LiveJournal::journal_content .... ah, hell. I think Devel::Dprof is pretty much useless for us, then?

Update: Rebuilt libcompress-zlib-perl with apt-build to see if gcc could make faster code for it if it didn't have to be 386-safe. Benchmarks, unfortunately, are exactly the same.

SPF records

Now that AOL is even publishing SPF records, I figure it's not a dorky dream that'll get nowhere, and LJ might as well support it as well!

$ dig @ livejournal.com txt | grep spf
livejournal.com. 3600 IN TXT "v=spf1 a mx ip4: include:danga.com ?all"

Unfortunately, I couldn't use "-all" (the strict mode), because I figure there are people who are sending mail from @livejournal.com just by forging their return address (if they have a paid account, they can get the reply).

We could deprecate that, if we gave people an alternate means to send mail (SMTP AUTH, Webmail, ?) but then we have to deal with spammers, paying $2.50/month for access to our outbound SMTP servers. I suppose we could just limit the number of outgoing emails/hour/day or something per user.

Anyway, just a heads up that you can now make your SpamAssassin give negative points to LJ emails now, since they can be authenticated as coming from us.