Brad Fitzpatrick (bradfitz) wrote in lj_dev,
Brad Fitzpatrick
bradfitz
lj_dev

Similar Interest Magic Index

If you have a paid account, you've probably played with the newly reinstated similar interest user search page.

A pre-warning about the Magic Index it uses to sort: not a ton of thought has gone into the constants in there.

I just wanted to get it live. I tweaked the numbers a bit until the results got better for a number of users, but it's far from perfect.

I have 18 hours of airports and flying tomorrow, so maybe I'll do the math and figure out some better constants.

Basically, the root of the magic is:

$magic{$_} = $pt_weight{$_}*20 + $pt_count{$_};

Where $pt_count is one point per similar interest, and $pt_weight is (1 / total users interested in that thing) points per matching interest.

I'd be interested to hear thoughts from alanj, toast, evan, and metadaisy in particular.

Relevant code is here:
http://cvs.livejournal.org/browse.cgi/~checkout~/livejournal/htdocs/interests.bml?rev=1.18

Search for "findsim" and read down from there, until the string "Magic Index?".

The reason this feature can be back is because the query time is bounded. We iterate over a user's (up-to-150) interests and for each, query a few hundred random users with that interest. We don't pull them all in, because, well... then it wouldn't be bounded and it wouldn't scale again. Remember: you can only have 150 interests, but there's no restriction that all 512,000 LJ users can't be interested in "sex". Besides, it's useless to pull all that in, since we shouldn't weight that interest match much anyway. Doing it "correctly" isn't possible in a reasonable amount of time. It could be a directory search filter, though, which has that HTTP recheck thing and checking to make sure one query at a time is going on... but then the dirsearchres2 format would have to have a header which said the data format, and one new format would have to include weights, so the directory.bml UI could show them.

But blah. Later, perhaps. This was a 15 minute hack this morning once I realized the trick was the LIMIT clause.

This post is to solicit weighting change suggestions only, not redoing the whole algorithm.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 5 comments