Collect error in the index #1097

TristanCacqueray · 2023-12-20T20:58:53Z

This PR enables collecting crawler errors so that the process does not stops when failing to process an entity.

related: #1093

src/Monocle/Backend/Index.hs

src/Macroscope/Worker.hs

This change enables storing crawler errors in the index.

TristanCacqueray · 2023-12-21T19:49:14Z

Note that I haven't tested this change with real data.

TristanCacqueray · 2023-12-22T16:06:27Z

src/Lentille/GraphQL.hs

        -- Yield the error and stop the stream
-        S.yield (Left $ GraphError e)
+        now <- lift mGetCurrentTime
+        S.yield (Left $ GraphError now e)


I guess here we should try to extract the pageInfo from the error to make the process continue, otherwise I'm not sure the current implementation will be able to resume after the error.

This change ensures that bounded query includes errors added during the current day. The elasticsearch filters for the errors changed from lte: 2023-12-22T00:00:00Z to lte: 2023-12-22T20:45:45Z, resulting in the desired behavior.

This change enables the web client to query the crawler api to collect any errors.

TristanCacqueray

@morucci the implementation is now mostly complete, but we might want to check it is working as expected, e.g. by trying to index the llvm project.

src/Macroscope/Worker.hs

TristanCacqueray · 2023-12-22T21:53:50Z

web/src/App.res

@@ -290,6 +290,34 @@ module About = {
  }
 }

+module Errors = {
+  module CrawlerError = {


This component could be improved to be more pretty :)

This change also replace bytestring-base64 with the more general purpose base64 library.

bigmadkev · 2023-12-23T23:49:34Z

Just ran this pull request locally against the LLVM project I'm working with.
Can confirm that it posted 464 entries when it got stuck again.

Wonder if it should continue on through as it's not hit a no next page error? As a clean install will get stuck on this one.

TristanCacqueray · 2023-12-24T16:25:48Z

@bigmadkev Thank you for testing this change. Could you please confirm that the crawler is actually stuck on the error? I would expect the "last update" timestamp to be set past the offending PR so that the crawler should be able to resume.

Though as indicated in the comment above, this implementation is presently skipping all the event happening after the error, I'll propose a change to decode the pageinfo from the error when it is available.

bigmadkev · 2023-12-24T17:56:42Z

I can see that the next run picks up all the updated items on the next run and doesn't hit the errored one again.

But the issue is that anything updated before the erroring one isn't indexed at all so rther than getting 9k it's only getting 500 ish up to the erroring pull request.

TristanCacqueray · 2023-12-24T18:26:00Z

Thank you for confirming, so that's expected. It looks like from the crawler.log you shared that we are getting the FetchErrorProducedErrors which actually contains the desired response, and I have updated this PR to handle that case. Make sure to delete your index if you want to try again. If you rebuild the web interface, you should see a red bell on the top right with a new errors page that should display The additions count for this commit is unavailable.

@morucci I am not sure this change is great because other unexpected errors might skip a large amount of data (because the crawler now always sets the last updated timestamp). Perhaps we should be more conservative in the Worker.processStream and only skip the PartialResult variant. In anycase, I have refactored the LentilleError and GraphQLError to help with further improvements.

Anyway I'll get back to this after the holidays, have a good end of the year! Cheers :)

This change enables grouping errors by crawler/entity

morucci · 2024-01-03T22:05:12Z

Well done ! Thanks !

@bigmadkev I was able to try this PR and indexed llvm/llvm-project.

morucci approved these changes Dec 21, 2023

View reviewed changes

morucci reviewed Dec 21, 2023

View reviewed changes

src/Monocle/Backend/Index.hs Show resolved Hide resolved

morucci reviewed Dec 21, 2023

View reviewed changes

src/Macroscope/Worker.hs Show resolved Hide resolved

TristanCacqueray added 6 commits December 21, 2023 18:55

index: add error document to the index

3d41e20

This change enables storing crawler errors in the index.

api: add error document type to the add doc endpoint

0cd8e46

crawler: emit crawler error when processing stream

b389c62

api: add error indexing

f9357eb

index: add error created_at attribute

1d808a0

crawler: base64 encode json blob

2475a4e

TristanCacqueray force-pushed the error-doc branch from 1042087 to ccdaff9 Compare December 21, 2023 19:48

TristanCacqueray force-pushed the error-doc branch from ccdaff9 to 80442d4 Compare December 21, 2023 19:56

test: fix the macroscope failure test

5cac8e3

TristanCacqueray force-pushed the error-doc branch from 80442d4 to 5cac8e3 Compare December 21, 2023 20:09

TristanCacqueray added 3 commits December 21, 2023 21:55

api: add crawler/errors endpoint to fetch errors

303dae2

index: store the entity and timestamp in the errors_data structure

5e660b0

test: verify the indexed error content

37656c9

TristanCacqueray commented Dec 22, 2023

View reviewed changes

TristanCacqueray added 6 commits December 22, 2023 16:06

crawler: continue processing even when there are decoding errors

ee8a869

chore: perform monocle-reformat-run

38119ec

api: update dropTime to keep the current hour

f179448

This change ensures that bounded query includes errors added during the current day. The elasticsearch filters for the errors changed from lte: 2023-12-22T00:00:00Z to lte: 2023-12-22T20:45:45Z, resulting in the desired behavior.

doc: add example to run a single test

4084bff

web: add crawler api codegen

9f040a7

This change enables the web client to query the crawler api to collect any errors.

web: display crawler errors

eec06d1

TristanCacqueray force-pushed the error-doc branch from a0e8ccd to eec06d1 Compare December 22, 2023 21:49

TristanCacqueray commented Dec 22, 2023

View reviewed changes

TristanCacqueray added 2 commits December 23, 2023 14:24

index: encode crawler error body by the api

4a71ad3

This change also replace bytestring-base64 with the more general purpose base64 library.

index: introduce new type for BinaryText

0ea04b3

TristanCacqueray force-pushed the error-doc branch from 7fb6b8f to 0ea04b3 Compare December 23, 2023 14:24

index: bump version to apply new mapping

0f1df8a

crawler: improve crawler error representation

796d395

TristanCacqueray added 2 commits December 24, 2023 16:34

crawler: introduce error variant for page-info

9fb3c3b

crawler: preserve the original fetch error from morpheus client

1731dee

crawler: handle partial results

32f140a

api: introduce CrawlerErrorList

4d2bf8a

This change enables grouping errors by crawler/entity

TristanCacqueray force-pushed the error-doc branch from a8b9a14 to 4d2bf8a Compare December 27, 2023 17:05

TristanCacqueray added 3 commits January 3, 2024 16:29

crawler: add stream error to stop the stream

23c4def

api: prevent error when submitting empty task data

9a60ca4

doc: add profiling build instructions

2ba9fdd

TristanCacqueray force-pushed the error-doc branch from e266c62 to 2ba9fdd Compare January 3, 2024 16:29

Rename PageInfoError to RateLimitInfoError

8f03288

morucci added the merge me Trigger the merge process label Jan 3, 2024

mergify bot merged commit 3a6276d into change-metrics:master Jan 3, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect error in the index #1097

Collect error in the index #1097

TristanCacqueray commented Dec 20, 2023 •

edited

Loading

TristanCacqueray commented Dec 21, 2023

TristanCacqueray Dec 22, 2023

TristanCacqueray left a comment

TristanCacqueray Dec 22, 2023

bigmadkev commented Dec 23, 2023

TristanCacqueray commented Dec 24, 2023

bigmadkev commented Dec 24, 2023

TristanCacqueray commented Dec 24, 2023

morucci commented Jan 3, 2024 •

edited

Loading

Collect error in the index #1097

Collect error in the index #1097

Conversation

TristanCacqueray commented Dec 20, 2023 • edited Loading

TristanCacqueray commented Dec 21, 2023

TristanCacqueray Dec 22, 2023

Choose a reason for hiding this comment

TristanCacqueray left a comment

Choose a reason for hiding this comment

TristanCacqueray Dec 22, 2023

Choose a reason for hiding this comment

bigmadkev commented Dec 23, 2023

TristanCacqueray commented Dec 24, 2023

bigmadkev commented Dec 24, 2023

TristanCacqueray commented Dec 24, 2023

morucci commented Jan 3, 2024 • edited Loading

TristanCacqueray commented Dec 20, 2023 •

edited

Loading

morucci commented Jan 3, 2024 •

edited

Loading