Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

Narrow site results in Google #22

Open
gwern opened this issue Dec 15, 2011 · 4 comments
Open

Narrow site results in Google #22

gwern opened this issue Dec 15, 2011 · 4 comments

Comments

@gwern
Copy link
Contributor

gwern commented Dec 15, 2011

I was quickly pasting a link summarizing existing predictions for a fine point in Methods of Rationality (https://encrypted.google.com/search?num=100&q=hat%20and%20cloak%20site%3Apredictionbook.com) and I noticed, not for the first time, that the actual prediction was being drowned out by the user pages.

Inasmuch as you can get to any relevant user page by the actual prediction, I think they're noise.

Fortunately, there's a very easy solution - we can just exclude /users/ in the robots.txt. However, in keeping with my usual long-term interest, I still want the Internet Archive to have access; I think we can combine the two like thus (wrote it up based on the Wikipedia info and http://www.archive.org/about/exclude.php and then I found a similar question at http://www.archive.org/post/234741/robotstxt ):

User-agent: ia_archiver
Allow: /users/

User-Agent: *
Disallow: /users/

The downside of this approach is that someone searching for a particular user would see their user-page pop up, but rather every prediction they've been involved in. This isn't a disadvantage for me, but it may be for others.

A less invasive but also less useful approach would be to augment the built-in Google site search with -site:predictionbook.com/users/. (Not as useful for me, since I rarely launch my PB site searches from that dialogue, but outside Firefox entirely.)

Anyway, I'm going to patch robots.txt as suggested above.

@ivankozik
Copy link

I think I would prefer the less-invasive approach, given how robots.txt is kind of a nuclear option that affects all crawlers (except for the one you exclude, of course).

This method of exclusion works:

hat and cloak site:predictionbook.com -inurl:predictionbook.com/users/

And you can make this useful for yourself too by changing your bookmark keyword search / Chrome search, no?

@gwern
Copy link
Contributor Author

gwern commented Dec 16, 2011

Well, yes, I know how to exclude the Google hits (I gave a better way above); my point was about defaults. Is there a good reason to expose user pages to search engines when we know for a fact that every prediction will generate N spurious hits where N is the number of users commenting or predicting on it?

@ivankozik
Copy link

Neat, I didn't know -site:url works too.

User pages are sometimes interesting, and given how Google is smart enough to rank them below prediction pages, I don't think they're a big problem. robots.txt also affects things like wget and HTTrack, and making them ignore robots.txt is annoying. (And if robots.txt later lists resources that should really be ignored, those robots.txt-ignoring users will be grabbing the really-useless resources.)

@gwern
Copy link
Contributor Author

gwern commented Dec 16, 2011

Alright, then what about just filtering it for the big 3 or 4 search engines? That'll help 99% of users and avoid hitting tools like wget.

keyist added a commit that referenced this issue Jan 24, 2012
issue #22: forbid search engines from user pages to reduce search result...
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants