I sped up the HTML cleaner a bunch today, using the Benchmark module. That works when you already know a major area that's a hot spot and slow, which I knew the HTML cleaner was.
But, when you don't know where to look, you turn to profilers, like Devel::DProf, which apparently works okay in Perl 5.8, but not Perl 5.6 (where it just segfaults).
Anyway, I obtained the following from Apache::DProf on a Perl 5.8 development machine:
(just haphazardly clicking around my dev install trying to emphasize hits on where we get the most: lastn, friends, comments, userinfo, etc)
# dprofpp -u -O 20 Apache::LiveJournal::journal_content has -1 unstacked calls in outer Exporter::import has -1 unstacked calls in outer AutoLoader::__ANON__ has 1 unstacked calls in outer Compress::Zlib::__ANON__ has 1 unstacked calls in outer IO::Socket::import has -1 unstacked calls in outer AutoLoader::AUTOLOAD has -1 unstacked calls in outer Exporter::Heavy::heavy_export has 2 unstacked calls in outer Total Elapsed Time = 98.84301 Seconds User Time = 3.843011 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 31.7 1.219 0.047 20 0.0610 0.0024 Apache::LiveJournal::journal_content 31.3 1.204 1.709 10 0.1204 0.1709 Apache::BML::handler 6.25 0.240 0.322 1464 0.0002 0.0002 BML::__ANON__ 5.85 0.225 0.232 785 0.0003 0.0003 DBD::mysql::db::prepare 2.84 0.109 0.139 56 0.0020 0.0025 Apache::LiveJournal::db_logger 2.84 0.109 0.127 56 0.0019 0.0023 Apache::LiveJournal::trans 2.81 0.108 0.125 8 0.0135 0.0156 LJ::Talk::load_comments 1.04 0.040 0.050 8 0.0050 0.0062 LJ::load_codes 1.01 0.039 0.049 16 0.0025 0.0030 LJ::load_user_props 1.01 0.039 0.038 86 0.0005 0.0004 DBD::mysql::dr::connect 0.96 0.037 0.037 842 0.0000 0.0000 DBI::st::fetch 0.78 0.030 0.030 68 0.0004 0.0004 BML::Cookie::FETCH 0.78 0.030 0.030 76 0.0004 0.0004 LJ::img 0.75 0.029 0.037 57 0.0005 0.0007 LJ::get_remote 0.75 0.029 0.037 56 0.0005 0.0007 LJ::end_request 0.70 0.027 0.027 871 0.0000 0.0000 DBD::_mem::common::DESTROY 0.52 0.020 0.020 19 0.0010 0.0010 Apache::BML::reset_codeblock 0.52 0.020 0.020 6 0.0033 0.0033 LJ::get_bio 0.52 0.020 0.020 1 0.0198 0.0196 LJ::Talk::Post::init 0.52 0.020 0.039 14 0.0014 0.0028 LJ::load_userpics
Now, why are Apache::LiveJournal::journal_content and Apache::BML::handler reporting so much CPU usage? They hardly do anything. They do, however, call the XS-based gzip code, which I believe Devel::DProf can't trace, so it's reported in the caller instead. (bleh) So I guess we have to ignore both those, which is 60% of the runtime.
Unless: I wonder if a Debian package of the Compress::Zlib built for 686 would help at all. Anybody able to profile the difference for me, or give me pointers on the best Debian way to go about that?
Moving on, we can't get at the profiling of BML::__ANON__ stuff (which is all BML pages). But... we for profiling have a special BML mode that compiles all BML pages to named subs instead of anonymous subs. That might prove interesting.
DBI stuff: ignore. I wasn't profiling with memcache. That stuff mostly goes away.
Apache::LiveJournal::db_logger/::trans: again, those just call a lot of XS stuff, they're not actually slow. (::trans might help to improve, but db_logger is simple)
load_comments.... it is big and perhaps CPU heavy, but I don't see any obvious way to improve it, and it's fragile, so I don't want to really mess with it.
I think the big thing to do here is investigate how to make gzip stuff faster, if possible, and break up the BML reporting so we can see where we really need to work on.
The good thing is that the HTML cleaner doesn't even show up in that list anymore.
Curious: why don't I see any S1 or S2 function calls in the dprofpp trace? (dprofpp -T shows no match for "create_view" or "LJ::S2", which it should?) Does Devel::Dprof exclude certain things? I don't see the pattern. Oh, probably because all that S1/S2/"make_journal" stuff is called from a closure within Apache::LiveJournal::journal_content .... ah, hell. I think Devel::Dprof is pretty much useless for us, then?
Update: Rebuilt libcompress-zlib-perl with apt-build to see if gcc could make faster code for it if it didn't have to be 386-safe. Benchmarks, unfortunately, are exactly the same.