Narrow site results in Google #22

gwern · 2011-12-15T19:16:55Z

I was quickly pasting a link summarizing existing predictions for a fine point in Methods of Rationality (https://encrypted.google.com/search?num=100&q=hat%20and%20cloak%20site%3Apredictionbook.com) and I noticed, not for the first time, that the actual prediction was being drowned out by the user pages.

Inasmuch as you can get to any relevant user page by the actual prediction, I think they're noise.

Fortunately, there's a very easy solution - we can just exclude /users/ in the robots.txt. However, in keeping with my usual long-term interest, I still want the Internet Archive to have access; I think we can combine the two like thus (wrote it up based on the Wikipedia info and http://www.archive.org/about/exclude.php and then I found a similar question at http://www.archive.org/post/234741/robotstxt ):

User-agent: ia_archiver
Allow: /users/

User-Agent: *
Disallow: /users/

The downside of this approach is that someone searching for a particular user would see their user-page pop up, but rather every prediction they've been involved in. This isn't a disadvantage for me, but it may be for others.

A less invasive but also less useful approach would be to augment the built-in Google site search with -site:predictionbook.com/users/. (Not as useful for me, since I rarely launch my PB site searches from that dialogue, but outside Firefox entirely.)

Anyway, I'm going to patch robots.txt as suggested above.

The text was updated successfully, but these errors were encountered:

ivankozik · 2011-12-16T00:49:05Z

I think I would prefer the less-invasive approach, given how robots.txt is kind of a nuclear option that affects all crawlers (except for the one you exclude, of course).

This method of exclusion works:

hat and cloak site:predictionbook.com -inurl:predictionbook.com/users/

And you can make this useful for yourself too by changing your bookmark keyword search / Chrome search, no?

gwern · 2011-12-16T01:33:17Z

Well, yes, I know how to exclude the Google hits (I gave a better way above); my point was about defaults. Is there a good reason to expose user pages to search engines when we know for a fact that every prediction will generate N spurious hits where N is the number of users commenting or predicting on it?

ivankozik · 2011-12-16T01:42:31Z

Neat, I didn't know -site:url works too.

User pages are sometimes interesting, and given how Google is smart enough to rank them below prediction pages, I don't think they're a big problem. robots.txt also affects things like wget and HTTrack, and making them ignore robots.txt is annoying. (And if robots.txt later lists resources that should really be ignored, those robots.txt-ignoring users will be grabbing the really-useless resources.)

gwern · 2011-12-16T01:44:05Z

Alright, then what about just filtering it for the big 3 or 4 search engines? That'll help 99% of users and avoid hitting tools like wget.

issue #22: forbid search engines from user pages to reduce search result...

keyist added a commit that referenced this issue Jan 24, 2012

Merge pull request #23 from gwern/patch-2

3633414

issue #22: forbid search engines from user pages to reduce search result...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narrow site results in Google #22

Narrow site results in Google #22

gwern commented Dec 15, 2011

ivankozik commented Dec 16, 2011

gwern commented Dec 16, 2011

ivankozik commented Dec 16, 2011

gwern commented Dec 16, 2011

Narrow site results in Google #22

Narrow site results in Google #22

Comments

gwern commented Dec 15, 2011

ivankozik commented Dec 16, 2011

gwern commented Dec 16, 2011

ivankozik commented Dec 16, 2011

gwern commented Dec 16, 2011