I dunno... my journal doesn't seem to be indexed, when my friends' are, and I have bot blocking enabled and they don't. I wonder if blog search pays attention to the robots.txt on the journal itself...
People can use the syn console function that turns off full content.
If it's public, it will be indexed, by someone who respects the rules (there are really no "no bots" headers for RSS/ATOM, as far as I know). And people who don't respect the rules will index anyway.
Allowing people to turn off syndication is good. Allowing people to block bots is good. Pretending that they meant to turn off syndication when they ask to block bots is very bad.
Some sort of no-robot extension needs to be added to RSS and Atom in order to properly fix this bug. And I'm totally against any fix that pretends that bot blocking means no syndication. That's not a fix, that's purposely adding a bug.
Now there is a fix I would accept, though it would be kind of silly and poor. That would be to check the User-Agent field for known bots. LJ would then refuse to hand them the feed by returning the appropriate 'access denied' HTTP header.
"Some sort of no-robot extension needs to be added to RSS and Atom in order to properly fix this bug."
...and then deployed on feeds other than just LJ's. And reconciled between Atom and RSS. And then search engines need to be updated to honor the spec. And all the other user agents have to honor the spec, too.
This is nightmare, boil-the-ocean type stuff. Of course, I'm biased because I'd have to be spending time each day evangelizing it. :)
Remember that there can only be one /robots.txt URL per hostname -- so for each user for whom you wanted to block http://www.livejournal.com/users/examplusername/data/rss, you'd need one line in the global LiveJournal robots.txt. Clearly not feasible given the hundreds of thousands of active accounts.
There is a post indexed in Google's new service that I have marked as private, although it was public for a very short time. It's important to keep in mind that any post that has ever been made public can be indexed forever, regardless of the security settings put on the post in the future.
Indeed. All my public entries (most of which are complete and utter nonsense) are listed, despite my having the 'block robots' function enabled. This makes me, well...less than happy.
This is a little off-topic, but why to the Blog Search result links take you to a page that then does a refresh to the real page? Some inspection shows that the Location: header is there but it's being served as HTTP status code 200 rather than 301 or 302 so my browser is ignoring it. If it's sent with 301 or 302 (or one of the other, similar 3xx response codes) it'll be much more useful to user-agents which aren't browsers as well as being less lame in those that are.
Count me among the users who are really annoyed at this.
friends-only seems like the most viable solution.
Blocking RSS entirely seems excessive just to keep out spiders. I like rss, and am not a spider.
Some kind of meta tag semantic in rss, or making rss accessible only via username.livejournal.com/rss which could then be protected by username.livejournal.com/ROBOTS.TXT, seems like the only solution which is really decent in general though.
Event that solution isn't brilliant, because not all LiveJournal-based sites actually uses the user vanity domain functionality. It would effectively disable the machine-readable data completely on those sites which don't.
I have both bot block enabled and my feed set to title only, and it still managed to grab a few of my entries - though, notably, nothing after last May, which I believe was when the synlevel option was added to the admin console.
It's really, really slimy of Google to do this before there is an acceptable, agreed-upon convention for RSS bot-blocking. Please feel free to let whomever made this decision know that I've got a lot of very unhappy people over here who'd like to talk to him or her. :/
Personally, when Brad implemented the bulk feed of all new posts and circumvented even the near-useless (but better than nothing!) title only RSS/Atom option, I decided that LJ just wasn't going to get it and that I'd have to go about setting up a community for people who want DMCA takedown notices sent to all the major search engines, including blog aggregator sites which provide search. Only reason I haven't done it yet is having better and more pressing things to do with my time. It's a crude and nasty hack but without a technical solution, it's the best there is.
It's already established that it's not acceptable to ignore robots.txt directives, so search engines using RSS or Atom as a techniical means to circumvent robots directives in the HTML version are clearly acting improperly.
I still don't like the law as a hack, though, even if it is the best there is today.
Yes...what is going to be done to not only block but remove our comment from google when we had previously set our accounts in a manner to block spiders and such?
A lot of us werent really aware of it until people on our friendslsit mentioned it. I am very pissed off about thisand sent a complaint to google. I don't know how the coding for LJ works. Is there any way for them to create an option where we can designate a post as "Livejournal only"? Because I mean sometimes you jsut want to share something and not friendslock it, but you don't want the entire free world to see it, you know?
September 14 2005, 17:57:26 UTC 6 years ago
September 14 2005, 17:58:16 UTC 6 years ago
September 14 2005, 18:04:27 UTC 6 years ago
If it's public, it will be indexed, by someone who respects the rules (there are really no "no bots" headers for RSS/ATOM, as far as I know). And people who don't respect the rules will index anyway.
September 14 2005, 21:13:02 UTC 6 years ago
In other words, wha?
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
September 14 2005, 18:07:17 UTC 6 years ago
September 14 2005, 18:12:44 UTC 6 years ago
September 14 2005, 18:18:13 UTC 6 years ago
Allowing people to turn off syndication is good. Allowing people to block bots is good. Pretending that they meant to turn off syndication when they ask to block bots is very bad.
Some sort of no-robot extension needs to be added to RSS and Atom in order to properly fix this bug. And I'm totally against any fix that pretends that bot blocking means no syndication. That's not a fix, that's purposely adding a bug.
Now there is a fix I would accept, though it would be kind of silly and poor. That would be to check the User-Agent field for known bots. LJ would then refuse to hand them the feed by returning the appropriate 'access denied' HTTP header.
September 15 2005, 04:52:05 UTC 6 years ago
...and then deployed on feeds other than just LJ's. And reconciled between Atom and RSS. And then search engines need to be updated to honor the spec. And all the other user agents have to honor the spec, too.
This is nightmare, boil-the-ocean type stuff. Of course, I'm biased because I'd have to be spending time each day evangelizing it. :)
6 years ago
6 years ago
September 14 2005, 18:25:56 UTC 6 years ago
I guess it just needs to be turned on for feeds?
September 14 2005, 18:42:20 UTC 6 years ago
/robots.txt
Remember that there can only be one /robots.txt URL per hostname -- so for each user for whom you wanted to block http://www.livejournal.com/users/exampl6 years ago
6 years ago
6 years ago
September 14 2005, 18:29:58 UTC 6 years ago
September 14 2005, 18:33:50 UTC 6 years ago
September 14 2005, 18:44:18 UTC 6 years ago
September 14 2005, 18:56:35 UTC 6 years ago
This is a little off-topic, but why to the Blog Search result links take you to a page that then does a refresh to the real page? Some inspection shows that the Location: header is there but it's being served as HTTP status code 200 rather than 301 or 302 so my browser is ignoring it. If it's sent with 301 or 302 (or one of the other, similar 3xx response codes) it'll be much more useful to user-agents which aren't browsers as well as being less lame in those that are.
September 14 2005, 19:09:16 UTC 6 years ago
friends-only seems like the most viable solution.
Blocking RSS entirely seems excessive just to keep out spiders. I like rss, and am not a spider.
Some kind of meta tag semantic in rss, or making rss accessible only via username.livejournal.com/rss which could then be protected by username.livejournal.com/ROBOTS.TXT, seems like the only solution which is really decent in general though.
September 14 2005, 19:21:47 UTC 6 years ago
Event that solution isn't brilliant, because not all LiveJournal-based sites actually uses the user vanity domain functionality. It would effectively disable the machine-readable data completely on those sites which don't.
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
September 14 2005, 19:11:08 UTC 6 years ago
September 14 2005, 20:55:28 UTC 6 years ago
September 14 2005, 21:03:34 UTC 6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
6 years ago
September 15 2005, 00:22:46 UTC 6 years ago
September 15 2005, 05:29:38 UTC 6 years ago
there is.
It's already established that it's not acceptable to ignore robots.txt directives, so search engines using RSS or Atom as a techniical means to circumvent robots directives in the HTML version are clearly acting improperly.
I still don't like the law as a hack, though, even if it is the best there is today.
September 15 2005, 12:47:06 UTC 6 years ago
6 years ago
September 15 2005, 21:53:36 UTC 6 years ago
The Rules
Bots should follow the rules.http://en.wikipedia.org/wiki/Three_Laws
September 16 2005, 00:20:31 UTC 6 years ago
Re: The Rules
Just watch out for Asimov's Literary Doomsday DeviceSeptember 15 2005, 22:30:17 UTC 6 years ago
September 16 2005, 00:09:53 UTC 6 years ago