-
Notifications
You must be signed in to change notification settings - Fork 134
FAQs
data_target
contains relevant pages.
data_negative
contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages.
data_monitor
contains current status of the crawler.
data_url
and data_backlinks
are where persistent storages keep information of frontier and crawled graph.
Unless you stop it, the crawler exists when the number of crawled pages exeeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv
to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.
We are welcome user to report any issue related to ACHE. Here is a guidline to use Github's tracker - Issues: https://guides.github.com/features/issues/