First of all: the research is to determine group sentiment towards particular topics. We will make a single crawl over LJ and then use the data to develop agregate results. Details of individual users will not be mentioned in the report.
For this project we plan to first seed our crawler with usernames and community names found in google searches on a particular topic. We will then crawl over community members and friends of users whose robots.txt files don't exclude crawlers. From each user we are considering only pulling the RSS file, or possibly pulling articles from up to a year in the past.
I have a few questions, though:
- What about pulling user comments? Would that be ok?
- How do I know if a community allows robots or not, since they don't have a subdomain with a robots.txt file?
- At what rate of requests/sec should I set my crawler to in order for it to be "nice"?
- Are there any previous simmilar works that anyone is aware of?
- For users that exclude robots: would it be appropriate if I send them a message via their livejournal email address asking if they would like to be included in the research?
Again, this is strictly not for profit, aggregate data collection for research purposes only. I would appreciate any comments anyone has.