Skip to content
Kien Pham edited this page Apr 1, 2015 · 13 revisions

What is inside the data output directory?

data_target contains relevant pages.

data_negative contains irrelevant pages. In default setting, the crawler does not save the irrelevant pages.

data_monitor contains current status of the crawler.

data_url and data_backlinks are where persistent storages keep information of frontier and crawled graph.

When to stop the crawler?

Unless you stop it, the crawler exists when the number of crawled pages exeeds the limit in the setting, which is 9M at default. You can look at this file data_monitor/harvestinfo.csv to know how many pages has been downloaded to decide whether you want to stop the crawler. The 1st, 2nd, 3rd columns are number of relevant pages, number of visited pages, timestamp.

Where to report bug?

We are welcome user to report any issue related to ACHE. Here is a guidline to use Github's tracker - Issues: https://guides.github.com/features/issues/