confuseme (confuseme) wrote in lj_dev,
confuseme
confuseme
lj_dev

crawling lj

So, I'm writing an lj crawler for a research project. I think it should be fairly polite, but I'd like to get some comments first to make sure I'm not missing anything. I don't want to hit the lj servers any harder than an ordinary user might, and I don't think I will. Let me know if you think otherwise, or if you have any suggestions. Here's how it works:

First, it hits the "random user" URL (http://www.livejournal.com/random.bml), and gets a username from the resulting redirect. It closes the connection as soon as it gets a user name, and never follows the redirect to request the actual user page.

Then, it figures out the URL for the user's info page, requests that page, and parses it for some information about the user (Birthdate and Location). It also closes that connection as soon as it has the data it needs. If that data doesn't meet certain criteria, the crawler stops here.

If the user does meet the criteria, it figures out the URL for the user's calendar page, requests it and parses it to get a list of journal entry links.

Finally, it requests the journal entries, one by one, with a pause between each request (I'm thinking 10 seconds or so, suggestions are welcome.)

I plan to run the script in a loop, with a significant pause between each run -- maybe something like 20 seconds.

I don't think that should be particularly different from an ordinary user browsing lj. Is there anything I'm missing here?
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 70 comments