-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect error in the index #1097
Conversation
This change enables storing crawler errors in the index.
1042087
to
ccdaff9
Compare
Note that I haven't tested this change with real data. |
ccdaff9
to
80442d4
Compare
80442d4
to
5cac8e3
Compare
src/Lentille/GraphQL.hs
Outdated
-- Yield the error and stop the stream | ||
S.yield (Left $ GraphError e) | ||
now <- lift mGetCurrentTime | ||
S.yield (Left $ GraphError now e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess here we should try to extract the pageInfo from the error to make the process continue, otherwise I'm not sure the current implementation will be able to resume after the error.
This change ensures that bounded query includes errors added during the current day. The elasticsearch filters for the errors changed from lte: 2023-12-22T00:00:00Z to lte: 2023-12-22T20:45:45Z, resulting in the desired behavior.
This change enables the web client to query the crawler api to collect any errors.
a0e8ccd
to
eec06d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morucci the implementation is now mostly complete, but we might want to check it is working as expected, e.g. by trying to index the llvm project.
@@ -290,6 +290,34 @@ module About = { | |||
} | |||
} | |||
|
|||
module Errors = { | |||
module CrawlerError = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This component could be improved to be more pretty :)
This change also replace bytestring-base64 with the more general purpose base64 library.
7fb6b8f
to
0ea04b3
Compare
Just ran this pull request locally against the LLVM project I'm working with. Wonder if it should continue on through as it's not hit a no next page error? As a clean install will get stuck on this one. |
@bigmadkev Thank you for testing this change. Could you please confirm that the crawler is actually stuck on the error? I would expect the "last update" timestamp to be set past the offending PR so that the crawler should be able to resume. Though as indicated in the comment above, this implementation is presently skipping all the event happening after the error, I'll propose a change to decode the pageinfo from the error when it is available. |
I can see that the next run picks up all the updated items on the next run and doesn't hit the errored one again. But the issue is that anything updated before the erroring one isn't indexed at all so rther than getting 9k it's only getting 500 ish up to the erroring pull request. |
Thank you for confirming, so that's expected. It looks like from the crawler.log you shared that we are getting the FetchErrorProducedErrors which actually contains the desired response, and I have updated this PR to handle that case. Make sure to delete your index if you want to try again. If you rebuild the web interface, you should see a red bell on the top right with a new errors page that should display @morucci I am not sure this change is great because other unexpected errors might skip a large amount of data (because the crawler now always sets the last updated timestamp). Perhaps we should be more conservative in the Anyway I'll get back to this after the holidays, have a good end of the year! Cheers :) |
This change enables grouping errors by crawler/entity
a8b9a14
to
4d2bf8a
Compare
e266c62
to
2ba9fdd
Compare
Well done ! Thanks ! @bigmadkev I was able to try this PR and indexed llvm/llvm-project. |
This PR enables collecting crawler errors so that the process does not stops when failing to process an entity.
related: #1093