haha! (sweet_daddy) wrote in lj_dev,

Fix for random.bml problem

Greetings lj_developers.

I've just joined this community for the purpose of making the following post.

I recently opened a support ticket because I was finding that over a period of a few weeks, random.bml was showing me some of the same journals again and again.

In real life I build artificial life simulations, and so I have developed something of a keen eye for finding bugs in random algorithms, which can often appear to be working when they actually aren't doing exactly what you want. I was pretty sure something was wrong with the random dataset. In the ticket, isabeau posted an excellent summary of the algorithm for random.bml. Based on his description I was able to go into the public CVS browser and have a look at the algorithm for myself.

I believe that there is a weakness in the random picking of journals done by the build_randomuserset subroutine (in the stats.pl file). I understand the motivation for creating a table once a day of the IDs of 5000 journals that will be displayed by the random link that day. The problem is that the way the 5000 journals are selected, the 5000 will always tend to be drawn from a small subpool of the journals. Here is why:

The SQL query being used to create the 5000 is as follows:

   $dbh->do("REPLACE INTO randomuserset (userid) " .
            "SELECT uu.userid FROM userusage uu, user u " .
            "WHERE u.userid=uu.userid AND u.allow_infoshow='Y' " .
            "AND uu.timeupdate > DATE_SUB(NOW(), INTERVAL 1 DAY) LIMIT 5000");

What this query does (apart from filtering those journals not in the directory) is select the first 5000 journals in table order which have been updated within the last 24 hours. Given that your table order probably doesn't change from day to day, those regular posters who happen to fall early in the table order will tend to always be included in the 5000 journals available to random.bml. Those users who may be regular updaters but fall in the second 5000 or later will never be selected for the randomuserset table because they will be cut out by the LIMIT, and therefore will never be selected by random.bml.

Now, isabeau wisely said that this is not likely to be a priority for the development team, and that any solution would need to not increase the load on the database. I decided to look into the code myself. Actually, the fix is super-simple. What you really want are a random 5000 rows of the query you've already got, instead of the first 5000 rows. So the fix is to add three words to your query: an ORDER BY clause making it:

   $dbh->do("REPLACE INTO randomuserset (userid) " .
            "SELECT uu.userid FROM userusage uu, user u " .
            "WHERE u.userid=uu.userid AND u.allow_infoshow='Y' " .
            "AND uu.timeupdate > DATE_SUB(NOW(), INTERVAL 1 DAY) ORDER BY RAND() LIMIT 5000");

For a reference to the legitimacy of this approach see the rand() documentation here (seems to be down today, see it here in the google cache). Actually, this does incrementally increase the load on the database at the moment of populating the randomuserset table because MySql doesn't do a great job of optimizing this kind of ORDER BY clause, but it's a once-a-day job and shouldn't be horrible given your query.

I believe that this simple change will increase the goodness of the random.bml function by a huge amount, because suddenly a large number of journals will qualify for the random wheel that have never been available before through the random link. Although it may not be important from the point of view of the casual browser having something interesting to read, it may be important from the point of view of lj users wanting their journals to be casually read.

I hope you'll agree with me, and that someone will take the time to add 3 words to livejournal/bin/maint/stats.pl I'm happy to write more about this if I've been unclear. Keep up the good work with lj coding generally.

  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded