Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect error in the index #1097

Merged
merged 28 commits into from
Jan 3, 2024
Merged

Conversation

TristanCacqueray
Copy link
Contributor

@TristanCacqueray TristanCacqueray commented Dec 20, 2023

This PR enables collecting crawler errors so that the process does not stops when failing to process an entity.

related: #1093

@TristanCacqueray
Copy link
Contributor Author

Note that I haven't tested this change with real data.

-- Yield the error and stop the stream
S.yield (Left $ GraphError e)
now <- lift mGetCurrentTime
S.yield (Left $ GraphError now e)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess here we should try to extract the pageInfo from the error to make the process continue, otherwise I'm not sure the current implementation will be able to resume after the error.

This change ensures that bounded query includes errors added during
the current day. The elasticsearch filters for the errors changed
from lte: 2023-12-22T00:00:00Z to lte: 2023-12-22T20:45:45Z, resulting
in the desired behavior.
This change enables the web client to query the crawler api to collect
any errors.
Copy link
Contributor Author

@TristanCacqueray TristanCacqueray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@morucci the implementation is now mostly complete, but we might want to check it is working as expected, e.g. by trying to index the llvm project.

src/Macroscope/Worker.hs Outdated Show resolved Hide resolved
@@ -290,6 +290,34 @@ module About = {
}
}

module Errors = {
module CrawlerError = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This component could be improved to be more pretty :)

This change also replace bytestring-base64 with the
more general purpose base64 library.
@bigmadkev
Copy link

Just ran this pull request locally against the LLVM project I'm working with.
Can confirm that it posted 464 entries when it got stuck again.

Wonder if it should continue on through as it's not hit a no next page error? As a clean install will get stuck on this one.

@TristanCacqueray
Copy link
Contributor Author

@bigmadkev Thank you for testing this change. Could you please confirm that the crawler is actually stuck on the error? I would expect the "last update" timestamp to be set past the offending PR so that the crawler should be able to resume.

Though as indicated in the comment above, this implementation is presently skipping all the event happening after the error, I'll propose a change to decode the pageinfo from the error when it is available.

@bigmadkev
Copy link

I can see that the next run picks up all the updated items on the next run and doesn't hit the errored one again.

But the issue is that anything updated before the erroring one isn't indexed at all so rther than getting 9k it's only getting 500 ish up to the erroring pull request.

@TristanCacqueray
Copy link
Contributor Author

Thank you for confirming, so that's expected. It looks like from the crawler.log you shared that we are getting the FetchErrorProducedErrors which actually contains the desired response, and I have updated this PR to handle that case. Make sure to delete your index if you want to try again. If you rebuild the web interface, you should see a red bell on the top right with a new errors page that should display The additions count for this commit is unavailable.

@morucci I am not sure this change is great because other unexpected errors might skip a large amount of data (because the crawler now always sets the last updated timestamp). Perhaps we should be more conservative in the Worker.processStream and only skip the PartialResult variant. In anycase, I have refactored the LentilleError and GraphQLError to help with further improvements.

Anyway I'll get back to this after the holidays, have a good end of the year! Cheers :)

This change enables grouping errors by crawler/entity
@morucci morucci added the merge me Trigger the merge process label Jan 3, 2024
@morucci
Copy link
Collaborator

morucci commented Jan 3, 2024

Well done ! Thanks !

@bigmadkev I was able to try this PR and indexed llvm/llvm-project.

@mergify mergify bot merged commit 3a6276d into change-metrics:master Jan 3, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge me Trigger the merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants