Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

crawler stops indexing results #32

Open
noerw opened this issue Sep 6, 2018 · 0 comments
Open

crawler stops indexing results #32

noerw opened this issue Sep 6, 2018 · 0 comments
Labels
bug Something isn't working

Comments

@noerw
Copy link
Contributor

noerw commented Sep 6, 2018

One of the these issues regularly occurs shortly after starting a new crawl:

  1. The crawler keeps crawling, but only indexes the initial seed URLs: Outlinks are fetched, the crawlstatus index is still updated, but no new pages appear in the results index.

  2. The crawler stops crawling entirely after a short time: No pages are fetched anymore at all, though the crawlstatus index contains newly discovered pages.

I can't reproduce this every time. Restarting the crawler continues the crawl as expected without further issues.
It seems to be a caching issue, as this only occurs for a crawl whose exact configuration has been used before (?).
After a thorough search I could not identify the cause; reverting to earlier versions does not fix the issue!

@noerw noerw added the bug Something isn't working label Sep 6, 2018
noerw added a commit that referenced this issue Sep 6, 2018
disabling es.status.reset.fetchdate.after somehow caused the crawler
to not return any new results.

another issue remains:
a small percentage of pages is crawled again and again, but never
indexed nor updated to FETCHED..
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant