Peter (digipete) wrote in lj_dev,
Peter
digipete
lj_dev

user-Interest searching

LiveJournal is my first SQL experience so I'm no DB expert, but I think I have a way to fix user-interest searching to allow multiple search terms and interest-matching between users without killing the site.

First let me say that interest-matching between users and searching by multiple interests is pretty much the same thing. The only difference is that most searches entered by users will have 2 or 3 keywords where user-interest matching looks at all the interests a user has. Both these problems can be addressed as one.

Here is my idea...

Basically I propose we do a more intelligent search in two steps. First we search for obscure interests, then we check to see if those users also have the more common interests.

The reason interest searching is so heavy on the DB is that there are a bunch of general interests which everyone has, like music and movies. So if I do a search on "music AND polka" MySql will look at all 80,000 users who list music as an interest and all 40 users who list polka for a total of 80,040 reads. Do you see the lunacy here? The good news is that we know how many users are attached to an interest before we start our search, thanks to the interests table. With this knowledge we can make two DB calls- the first returns all users who list polka as an interest, then looks at those 40 users to see which of them also list music. We don't even have to look at all the interests for those 40 users because the table is indexed in such a way that we only need to see if they list music or not. Now we've only done 80 reads.

So what do we do if someone does a search on "music AND movies"? Well first we need to set a MAX_USERS variable, which is the maximum number of users we will allow to be returned on our first search. For this demonstration, let's make it 500. When we see that there are 80,000 and 50,000 users attached to each interest respectively, we can grab the first 500 users that like movies, then see which of them like music. The result will read "Your search yielded too many results, only the first xxx users are shown."

Sometimes we don't what to see the users who have all the interests of a given set, but instead want to see who has the most of a set (aka. interest matching aka. a ranked 'or' search). Let's say we search for "music OR polka OR atari OR laundry OR homer OR pikachu OR cake OR blonde weasels" The number of users for each of these interests is 80000, 40, 350, 60, 110, 186, 1847, 1. Let's be smart and concatenate some of the more obscure interests. Given that MAX_USERS is 500, our first DB call will return a user list for everyone who has the interests blonde weasels, polka, laundry, homer, or pikachu because 1+40+60+110+186=397. If we added atari, we would be pushed over the max to 747. In our second DB call we take our 397 users and see how many interests out of the full list each of them has.

Issues:
  • As you may have figured out, the ranked OR search doesn't return a fully correct result. The only way to do that is to search almost the whole userinterests table, which the old search did and is the reason it was taken down. I think my search would be just as useful because it's based on users who share some of the most obscure interests, which to me is more intriguing.


Implementation:
  • I'm not sure whether it is better to store the results from the first SELECT in a temporary MySQL table or in a Perl hash. My hunch is the SQL table, but I'd like someone with more DB knowledge to verify this.
  • On OR searches, it is better to retrieve userID, username, journal_name, journal_type, and statusvis on the first pass to eliminate any statusvis!='v' before the second pass. Then just use userIDs in the second pass. On AND searches, more people will probably be eliminated by the second pass than by their statusvis flag, so it's probably best to get all that info as part of that second call. This sound right?
  • I would break large single-interest searches into multiple pages, but large AND searches don't page very well.
  • I may be missing something...


Sound reasonable?
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 7 comments