-
Notifications
You must be signed in to change notification settings - Fork 5
Narrow site results in Google #22
Comments
I think I would prefer the less-invasive approach, given how robots.txt is kind of a nuclear option that affects all crawlers (except for the one you exclude, of course). This method of exclusion works:
And you can make this useful for yourself too by changing your bookmark keyword search / Chrome search, no? |
Well, yes, I know how to exclude the Google hits (I gave a better way above); my point was about defaults. Is there a good reason to expose user pages to search engines when we know for a fact that every prediction will generate N spurious hits where N is the number of users commenting or predicting on it? |
Neat, I didn't know User pages are sometimes interesting, and given how Google is smart enough to rank them below prediction pages, I don't think they're a big problem. robots.txt also affects things like wget and HTTrack, and making them ignore robots.txt is annoying. (And if robots.txt later lists resources that should really be ignored, those robots.txt-ignoring users will be grabbing the really-useless resources.) |
Alright, then what about just filtering it for the big 3 or 4 search engines? That'll help 99% of users and avoid hitting tools like wget. |
issue #22: forbid search engines from user pages to reduce search result...
I was quickly pasting a link summarizing existing predictions for a fine point in Methods of Rationality (https://encrypted.google.com/search?num=100&q=hat%20and%20cloak%20site%3Apredictionbook.com) and I noticed, not for the first time, that the actual prediction was being drowned out by the user pages.
Inasmuch as you can get to any relevant user page by the actual prediction, I think they're noise.
Fortunately, there's a very easy solution - we can just exclude /users/ in the robots.txt. However, in keeping with my usual long-term interest, I still want the Internet Archive to have access; I think we can combine the two like thus (wrote it up based on the Wikipedia info and http://www.archive.org/about/exclude.php and then I found a similar question at http://www.archive.org/post/234741/robotstxt ):
The downside of this approach is that someone searching for a particular user would see their user-page pop up, but rather every prediction they've been involved in. This isn't a disadvantage for me, but it may be for others.
A less invasive but also less useful approach would be to augment the built-in Google site search with
-site:predictionbook.com/users/
. (Not as useful for me, since I rarely launch my PB site searches from that dialogue, but outside Firefox entirely.)Anyway, I'm going to patch robots.txt as suggested above.
The text was updated successfully, but these errors were encountered: