Evan Martin ([info]evan) wrote in [info]lj_dev,

feeds don't block robots

Now that blog search is getting some new press, I expect a lot more users to be unhappy about bug 2232.

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    Your reply will be screened

    Your IP address will be recorded 

  • 59 comments

[info]thehumangame

September 14 2005, 17:57:26 UTC 6 years ago

I dunno... my journal doesn't seem to be indexed, when my friends' are, and I have bot blocking enabled and they don't. I wonder if blog search pays attention to the robots.txt on the journal itself...

[info]thehumangame

September 14 2005, 17:58:16 UTC 6 years ago

Meh, scratch that. It is indexed, just sporadically so. I guess this service isn't really ready yet.

[info]marksmith

September 14 2005, 18:04:27 UTC 6 years ago

People can use the syn console function that turns off full content.

If it's public, it will be indexed, by someone who respects the rules (there are really no "no bots" headers for RSS/ATOM, as far as I know). And people who don't respect the rules will index anyway.

[info]flying_squirrel

September 14 2005, 21:13:02 UTC 6 years ago

Is this "syn function" undocumented, or am I looking in the wrong place?

In other words, wha?

[info]ruakh

6 years ago

[info]ruakh

6 years ago

[info]njyoder

6 years ago

[info]feignedapathy

September 14 2005, 18:07:17 UTC 6 years ago

Yeah, noticed this. Whoops.

[info]xfyre

September 14 2005, 18:12:44 UTC 6 years ago

And they really are, in Russian segment also.

[info]omnifarious

September 14 2005, 18:18:13 UTC 6 years ago

Allowing people to turn off syndication is good. Allowing people to block bots is good. Pretending that they meant to turn off syndication when they ask to block bots is very bad.

Some sort of no-robot extension needs to be added to RSS and Atom in order to properly fix this bug. And I'm totally against any fix that pretends that bot blocking means no syndication. That's not a fix, that's purposely adding a bug.

Now there is a fix I would accept, though it would be kind of silly and poor. That would be to check the User-Agent field for known bots. LJ would then refuse to hand them the feed by returning the appropriate 'access denied' HTTP header.

[info]anildash

September 15 2005, 04:52:05 UTC 6 years ago

"Some sort of no-robot extension needs to be added to RSS and Atom in order to properly fix this bug."

...and then deployed on feeds other than just LJ's. And reconciled between Atom and RSS. And then search engines need to be updated to honor the spec. And all the other user agents have to honor the spec, too.

This is nightmare, boil-the-ocean type stuff. Of course, I'm biased because I'd have to be spending time each day evangelizing it. :)

[info]jamesd

6 years ago

[info]thehumangame

September 14 2005, 18:25:56 UTC 6 years ago

After a bit of investigation, I found that Google Blogsearch implies that it will respect robots.txt.

I guess it just needs to be turned on for feeds?

[info]pne

September 14 2005, 18:42:20 UTC 6 years ago

/robots.txt

Remember that there can only be one /robots.txt URL per hostname -- so for each user for whom you wanted to block http://www.livejournal.com/users/examplusername/data/rss, you'd need one line in the global LiveJournal robots.txt. Clearly not feasible given the hundreds of thousands of active accounts.

[info]mart

6 years ago

[info]mendel

6 years ago

[info]chris21718

September 14 2005, 18:29:58 UTC 6 years ago

There is a post indexed in Google's new service that I have marked as private, although it was public for a very short time. It's important to keep in mind that any post that has ever been made public can be indexed forever, regardless of the security settings put on the post in the future.

[info]wtf

September 14 2005, 18:33:50 UTC 6 years ago

Indeed. All my public entries (most of which are complete and utter nonsense) are listed, despite my having the 'block robots' function enabled. This makes me, well...less than happy.

[info]purly

September 14 2005, 18:44:18 UTC 6 years ago

not cool.

[info]mart

September 14 2005, 18:56:35 UTC 6 years ago

This is a little off-topic, but why to the Blog Search result links take you to a page that then does a refresh to the real page? Some inspection shows that the Location: header is there but it's being served as HTTP status code 200 rather than 301 or 302 so my browser is ignoring it. If it's sent with 301 or 302 (or one of the other, similar 3xx response codes) it'll be much more useful to user-agents which aren't browsers as well as being less lame in those that are.

[info]octal

September 14 2005, 19:09:16 UTC 6 years ago

Count me among the users who are really annoyed at this.

friends-only seems like the most viable solution.

Blocking RSS entirely seems excessive just to keep out spiders. I like rss, and am not a spider.

Some kind of meta tag semantic in rss, or making rss accessible only via username.livejournal.com/rss which could then be protected by username.livejournal.com/ROBOTS.TXT, seems like the only solution which is really decent in general though.

[info]mart

September 14 2005, 19:21:47 UTC 6 years ago

Event that solution isn't brilliant, because not all LiveJournal-based sites actually uses the user vanity domain functionality. It would effectively disable the machine-readable data completely on those sites which don't.

[info]octal

6 years ago

[info]heydusty

6 years ago

[info]anildash

6 years ago

[info]jamesd

6 years ago

[info]evan

6 years ago

[info]octal

6 years ago

[info]jamesd

6 years ago

[info]ladysorka

September 14 2005, 19:11:08 UTC 6 years ago

I have both bot block enabled and my feed set to title only, and it still managed to grab a few of my entries - though, notably, nothing after last May, which I believe was when the synlevel option was added to the admin console.

[info]rahaeli

September 14 2005, 20:55:28 UTC 6 years ago

It's really, really slimy of Google to do this before there is an acceptable, agreed-upon convention for RSS bot-blocking. Please feel free to let whomever made this decision know that I've got a lot of very unhappy people over here who'd like to talk to him or her. :/

[info]evan

September 14 2005, 21:03:34 UTC 6 years ago

technorati.com, icerocket.com, feedster.com

[info]octal

6 years ago

[info]mart

6 years ago

[info]octal

6 years ago

[info]midendian

6 years ago

[info]ruakh

6 years ago

[info]anildash

6 years ago

[info]emarkienna

6 years ago

[info]abates

September 15 2005, 00:22:46 UTC 6 years ago

This may require detecting hits on RSS feeds from GoogleBot and giving it a blank feed. hmmm.

[info]jamesd

September 15 2005, 05:29:38 UTC 6 years ago

Personally, when Brad implemented the bulk feed of all new posts and circumvented even the near-useless (but better than nothing!) title only RSS/Atom option, I decided that LJ just wasn't going to get it and that I'd have to go about setting up a community for people who want DMCA takedown notices sent to all the major search engines, including blog aggregator sites which provide search. Only reason I haven't done it yet is having better and more pressing things to do with my time. It's a crude and nasty hack but without a technical solution, it's the best
there is.

It's already established that it's not acceptable to ignore robots.txt directives, so search engines using RSS or Atom as a techniical means to circumvent robots directives in the HTML version are clearly acting improperly.

I still don't like the law as a hack, though, even if it is the best there is today.

[info]rho

September 15 2005, 12:47:06 UTC 6 years ago

I thought that "set latest_optout yes" in the admin console worked to remove oneself from the latest posts bulk feed?

[info]jamesd

6 years ago

[info]rabababa

September 15 2005, 21:53:36 UTC 6 years ago

The Rules

Bots should follow the rules.
http://en.wikipedia.org/wiki/Three_Laws_of_Robotics

[info]bitxdeadweight

September 16 2005, 00:20:31 UTC 6 years ago

Re: The Rules

Just watch out for Asimov's Literary Doomsday Device

[info]macaholic

September 15 2005, 22:30:17 UTC 6 years ago

Yes...what is going to be done to not only block but remove our comment from google when we had previously set our accounts in a manner to block spiders and such?

[info]secretbutterfly

September 16 2005, 00:09:53 UTC 6 years ago

A lot of us werent really aware of it until people on our friendslsit mentioned it. I am very pissed off about thisand sent a complaint to google. I don't know how the coding for LJ works. Is there any way for them to create an option where we can designate a post as "Livejournal only"? Because I mean sometimes you jsut want to share something and not friendslock it, but you don't want the entire free world to see it, you know?
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…