Brad Fitzpatrick ([info]bradfitz) wrote in [info]lj_dev,

Google Blog Search -- relax, yo

I got this email from Google:
Hey Brad,

EvanM passed a note our way from LJ Tech Support, regarding Blog Search and its accidental indexing of "noindex" LJ content. Just wondering if you guys could let your users know that this was entirely unintentional, and a fix should go live within the next day or two? (hopefully tomorrow)

Thanks,
E
So y'all can relax.

(I'm also talking to them about RSS/Atom specs for indicating noindex so they don't have to hit up HTML to learn about it.)

And please, people, stop spreading paranoia: they're not using RSS as a "workaround" to not obey robots.txt and noindex... that's just silly on so many levels.

Remember the golden rule on the Internets:
Never attribute to malice what can be adequately explained by stupidity.
... or in this case, an accident.

  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    Your reply will be screened

    Your IP address will be recorded 

  • 63 comments

[info]halfawake

September 16 2005, 04:20:29 UTC 6 years ago

Neat, thanks for letting us know about this. Might want to have someone mention it in [info]news as well.

[info]zach

September 16 2005, 04:20:54 UTC 6 years ago

Sweet deal.
Not that I complained in the first place. I welcome all robots to my journal. =)

[info]idigital

September 16 2005, 08:22:53 UTC 6 years ago

I amz az robotz. Myz Prime Directivez are to Indexz your LivezJournalz.

[info]zach

6 years ago

[info]idigital

6 years ago

[info]shmuelisms

6 years ago

[info]burr86

September 16 2005, 04:26:16 UTC 6 years ago

Oh, wow, thanks! *goes back to the support board to reassure a bunch of people* :)

[info]azurelunatic

September 16 2005, 04:46:38 UTC 6 years ago

Good to hear.

[info]jay

September 16 2005, 04:53:02 UTC 6 years ago

neat

It's nice to know that a press conference and involvement from grass roots organizations was not required.

[info]zach

September 16 2005, 05:56:37 UTC 6 years ago

Is it just me, or have some of the comments on this entry kept disappearing and reappearing?

[info]azurelunatic

September 16 2005, 08:47:02 UTC 6 years ago

Is the content inane, off-topic, blatantly offensive and/or obscene?

[info]zach

6 years ago

[info]mart

September 16 2005, 06:15:35 UTC 6 years ago

This seems like a good opportunity to solve this properly with HTTP headers:

X-Robot-Prefs: noindex, nofollow
X-Robot-See: /users/mart/robots.txt

Aside from the obvious benefit that it can then apply to any media type including images, having it “out of band” means that the code to handle it can be centralised to LJ::make_journal rather than duplicating it in S1, S2 and talkread.bml. Still needs to go in a few awkward BML pages, but it's still a win. (Of course, the old robots blocking will no dout have to stay where it is for the benefit of those mythical “other search engines” I've heard about.)

If you just come up with some half-baked solution specific to RSS and Atom we'll be doing this dance again soon enough. For the people who are using stunted webservers and can't set such things, the problem for the Atom/RSS folks then becomes a way to do http-equiv like HTML does, allowing these header fields to be embedded into the document. That doesn't have to be LJ's problem, though.

[info]nikolasco

September 16 2005, 06:33:29 UTC 6 years ago

Out of curiousity, why the robots.txt + X-robot approach? For specific user agents or wildcards? I was thinking of handling such things at the server level (e.g. mod_xrobot reads robot.txt).

My other thought of the day is the need for something more specific than whole document inclusion/exclusion, in light of aggregations like atom-stream.xml. I like the idea of an XML attribute. For example:
<feed xmlns='http://www.w3.org/2005/Atom' r:index="no" xmlns:r="http://namespace/robot/">

[info]mart

6 years ago

[info]jamesd

6 years ago

[info]mart

September 16 2005, 07:07:50 UTC 6 years ago

Robots are user-agents too

[info]zach

September 16 2005, 07:18:14 UTC 6 years ago

Aw poor robot...

[info]davidkevin

6 years ago

[info]zach

6 years ago

[info]orangemike

6 years ago

[info]gizmometer

September 16 2005, 12:23:22 UTC 6 years ago

Oh, yay. :)

[info]chgowiz

September 16 2005, 14:38:00 UTC 6 years ago

I'll believe it when I see it - I think there'll be a few growing pains.

[info]isidorenabi

September 16 2005, 18:16:04 UTC 6 years ago

seems to be OK now (for me at least) - yesterday when i searched on my username i got hits from my own lj, now i only get hits from other people who've used my username in their posts.

[info]njyoder

September 16 2005, 20:40:54 UTC 6 years ago

Yes, I checked too and all the hits from my journal are gone now too. Google cleared up the problem quickly. What I'm wondering is, how could Google make such an error in the first place. You'd think their google blog indexing code would use the same code that checked the meta tags as their general search engine.

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]jamesd

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]7rin

6 years ago

[info]metaphorge

September 16 2005, 22:04:27 UTC 6 years ago

IMHO LiveJournal should not block search spiders for public entries at all. Such moves laregely defeat the point of the Internet.

If someone doesn't want their entry to be accessible, it should not be public. Period.

[info]njyoder

September 16 2005, 22:20:41 UTC 6 years ago

That's totally illogical. From a benefit/drawback standpoint, your proposal has one extra drawback and absolutely no benefits. If people were forced to do that, then they'd just make their entries FO, which would defeat the whole purpose. In that case, no humans other than people they've friended can read it. In the case where they CAN block spiders, then you have the added benefit of allowing any humans to read it.

Your whole "point of the internet" emotive argument is stupid anyway. The internet serves many purposes and one of them is not to have 100% of information accessible via search engines. Do you want your medical records accessible via search engines? No? What's wrong? I thought the point of the internet was to have everything accessible via a search engine.

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]njyoder

6 years ago

[info]metaphorge

6 years ago

[info]jamesd

6 years ago

[info]elementa

September 17 2005, 04:03:35 UTC 6 years ago

as of 10 minutes ago, I have hits on many of my public entries AND one of my friends-protected entries which includes text of the entry.

any idea on when this will change and why it happened?

[info]7rin

September 17 2005, 05:49:08 UTC 6 years ago

Cor... you got a personal email from Google! Wow Brad, you must be like, famous, or something.

[info]tuscendi

September 17 2005, 23:53:33 UTC 6 years ago

As of right now, two days later, a bunch of my LJ posts (and my LJ journal is Friends Only by default) are fully available on Blog Search.

I know I'm an ignoramus about these things, but I'm still shocked that our journal cannot be protected from the unscrupulosity or incompetence of the people who run the search engines. I had assumed our privacy was fully protected by Live Journal which I trust.

[info]hilltop

September 19 2005, 19:55:48 UTC 6 years ago

It also indexes on the Journal Title

My journal has the Title of Unquiet Ether.
Do a search on that, and I'm turning up all over the place, although Hilltop isn't.

[info]f_l_i_r_t

September 19 2005, 22:13:29 UTC 6 years ago

I agree with other people when they say we cannot believe that these journals cannot be protected better. You tell us to stop over reacting but it is over a week and all our sites are still cached and showing on Google Blog search, I am not impressed.

Time to ditch Live Journal? I and a bunch of my friends are all feeling the same, I think this really needs to be adressed in a more serious fashion and not fobbed off as 'hysterical users'.

When friends only posts still show on the search there is an issue with your security code, no?

[info]surrealist_post

September 22 2005, 21:33:26 UTC 6 years ago

This is unrelated but.. how do I hide the new 'schools' identifier listed in the bio page? I have absolutely no use for it and don't want it there, but I can't seem to find a 'hide' option for it. I assumed that it would be like memories, if you have none, it doesn't show, but it does show, and I'd like it gone. Thanks.

[info]prissi

September 24 2005, 06:41:21 UTC 6 years ago

You currently can't hide the 'schools' section of your bio page.

If you want to suggest that this feature be added to LiveJournal, you can offer it up at [info]suggestions. FAQ on suggestions here.

[info]imfallingup

September 29 2005, 02:14:22 UTC 6 years ago

any more word on this? i'm still pulling up a friendslocked entry right now...
Create an Account
Forgot your login or password?
Facebook Twitter More login options
English • Español • Deutsch • Русский…