This repository has been archived by the owner on May 5, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 9
blacklist being ignore #2
Comments
Sorry for not responding for so long... Cheers! |
I’m sorry to hear that. The plugin seemed like a well-designed solution to a common problem. The ability to identify page elements for inclusion or exclusion, either by html element or css selector, or by some hidden html comments (e.g., <!—startspider -->, <!—stopspider -->) feels like a very important piece of functionality.
As for our status, for the moment we have backed away from using Nutch to crawl our websites, and have therefore not attempted to fix this plugin or found an alternative.
Thanks very much for responding.
From: Bojan Tomić [mailto:[email protected]]
Sent: Tuesday, December 19, 2017 10:16 PM
To: kaqqao/nutch-element-selector
Cc: Kissman, Paul (BLC); Author
Subject: Re: [kaqqao/nutch-element-selector] blacklist being ignore (#2)
Sorry for not responding for so long...
I don't actually maintain nor use this plugin any more, and I unfortunately no longer remember how it works :(
I hope you've managed to find your answer in the mean time. Either case, please post back your current status, and the answer if you've got it.
Cheers!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_kaqqao_nutch-2Delement-2Dselector_issues_2-23issuecomment-2D352953321&d=DwMFaQ&c=lDF7oMaPKXpkYvev9V-fVahWL0QWnGCCAfCDz1Bns_w&r=MonHUvX27fwQ9bMHWH74_alfWcxUR3Nnj7OGF-5yS48&m=nKMe2k0KMb4ETvsE8hW_WFN_-pdp575badZcss0ZWv8&s=w_YJEKFVtdv0W-wiXd3Gcsk0icHbSxbUxWC82X7RIQI&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AZC1X3N1rEKiIh02B8AYoEnix7n9CYWKks5tCHvlgaJpZM4QH3N1&d=DwMFaQ&c=lDF7oMaPKXpkYvev9V-fVahWL0QWnGCCAfCDz1Bns_w&r=MonHUvX27fwQ9bMHWH74_alfWcxUR3Nnj7OGF-5yS48&m=nKMe2k0KMb4ETvsE8hW_WFN_-pdp575badZcss0ZWv8&s=qbI3Q6vH2JCvx0alIEsAxHi7rIY0qKq3eeTFMhPhTwE&e=>.
|
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
This is probably a problem with my setup rather than your plugin.
I have nutch-2.3.1 and have installed your plugin to get rid of a bunch of navigation elements, breadcrumbs, footers components from my standard web pages.
I've set my nutch-site.xml property as follows (also using tika for pdf and word documents)
plugin.includes protocol-httpclient|urlfilter-regex|element-selector|index-(basic|more)|query-(basic|site|url)|indexer-solr|nutch-extensionpoints|parse-(text|html|tika|js)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)And here are the elements I am trying to grab with the blacklist, also in nutch-site.xml
parser.html.selector.blacklist header,footer,div.breadcrumbs,div.searchForm,table.jobTable A comma-delimited.... I've tried with parse-html included and excluded. I read that if you enable tika don't use parse-html. But I thought that for your plugin to work, parse-html must be enabled.Any guidance would be helpful.
Thanks very much in advance.
The text was updated successfully, but these errors were encountered: