-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing server extension (#1046) #1069
Merged
+546
−93
Merged
Changes from 4 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
6c0958a
extend db methods for DBWorkspace
MehmedGIT a74e7be
make db workspace method flexible
MehmedGIT 669437d
db lock/unlock workspaces
MehmedGIT ce98957
refactoring to match new requirements
MehmedGIT a273439
implement page-wise locking and internal callback
MehmedGIT c7b6046
refactor comments
MehmedGIT efc9ec7
Fix worker/processor server access to DB
MehmedGIT 82302d7
Limit pydantic to max version 1
MehmedGIT e765bed
Merge branch 'master' into processing_server_ext_1046
kba 347c80c
Merge branch 'master' into processing_server_ext_1046
kba becfe23
implement depends on feature
MehmedGIT 2b0c4ea
remove unnecessary import
MehmedGIT 70dfcec
Fix depends on and internal callback mechanism
MehmedGIT 23591ec
Add possibility to overwrite internal_callback_url
joschrew b6d7baf
Adjust some log levels
joschrew afd97c8
Fix not available job as depends_on
joschrew 5e5da5c
Use list instead of set for locked pages
joschrew c17a596
fix page locking
MehmedGIT efbb628
multiple requests consumed from internal queue
MehmedGIT 79016bf
improve code: locking/unlocking
MehmedGIT File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being_processed
in the DBWorkspace from boolean to stringdata
withoutpage_id
in a canonical way, setbeing_processed
to that when lockingdata
is the same asbeing_processed
. If so, allow the request, even though workspace_db isbeing_processed
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objection by @MehmedGIT in today's call: Processing Server clients (like the Workflow Server) could decide to split up the job into several page ranges (esp. if they know there are multiple worker instances running in the back) and issue them concurrently. So instead of a boolean or string, we would need to synchronise over the actual set of pages of each workspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not achieving proper workspace locking when the same request is created with overlapping page ranges for the same workspace.
This is the way to achieve proper page range locking to prevent collisions on the processor server/worker level, however, also complex to implement. The more I think about it the more problematic it seems when considering error handling.
page_id
field will be in the formstart_page_id..end_page_id
. Thebeing_processed
boolean field will be replaced with a str fieldlocked_pages
in the formstart_page_id1..end_page_id1,...,start_page_id4..end_page_id4
. Then detecting overlapping single pages or page ranges is achievable to raise errors for the next coming requests. If the ranges don't overlap and match for the same processor, then the workspace would not be considered locked. For other processors, the workspace will still be considered locked. The unlocking of the workspace for pages will then happen based on the internal callbacks to the Processing Server from the Processor Server or Processing Worker when the execution finishes or fails.Problem: Still, in the case when either of the two fails to use a callback, say due to a crash, the workspace will be indefinitely locked. Considering a proper timeout for a server or worker is hard.
Considering that
page_id
could be:PHYS0001
PHYS0003,PHYS0004,PHYS0005
PHYS0009..PHYS0015
And now assuming a potential scenario where 3 processing requests with different
page_id
formats above are passed - the value oflocked_pages
for the 3 requests will bePHYS0001_PHYS0003,PHYS0004,PHYS0005_PHYS0009..PHYS0015
assuming that the separator will be an underscore. Or just a string list with values[PHYS0001][PHYS0003,PHYS0004,PHYS0005][PHYS0009..PHYS0015]
.Potential problem:
locked_pages
is filled with a lot of values in case single page requests are passed for the entire workspace. Another thing to consider is collision detection - the incomingpage_id
can have 3 different forms, and hence, lots of extra overhead of handling data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise errors? IMO, locking is simply about blocking requests until the associated resources become free again.
Why make this contingent on the identity of the processor? IMO, if the other request concerns a disjunct set of pages, then there simply cannot be a dependency relation to the currently running processor(s). It could be because the other request is prior (lagging behind) or posterior (more advanced) in the overall workflow across pages. Or it could simply be an independent request. (Conversely, independent requests which do affect the same page range will gain an artificial dependency relation here, but that's not that much of a problem.)
If anything, if we want to look more closely into dependencies, we should consider pairs
set(pages) · fileGrp
, whereThus, even requests on the same pages but regarding distinct fileGrps would stay independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem: simply use the actual resolved list (or rather, set) of pages! Set operations (disjunction for adding, intersection for testing, difference for clearing) are clear and efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was just considering the cases where double executions of the same processor with overlapping pages are submitted. That would mostly end in error, wouldn't it? Unless overwrite is already set for the next requests.
Because the workspace will be unlocked for the same processor requests that potentially can have different page ranges. The collisions for the same processor but different pages are handled by the METS server. However, if we get a different processor but the same pages, then the workspace is locked and that request goes into the waiting internal queue (CACHED). Consider cases when running processors cannot be done independently, i.e., the output of the first processor has to be used by the second processor.
Right, that means we must track
processor · set(pages) · fileGrp
(considering the info from the next paragraph). This becomes even harder to track (and potentially debug) what to submit directly in the RabbitMQ and what to queue internally (to cache).Oh, right. We should also consider fileGrp in the mix of complexity...
This is of course simpler to manage although there will be more stress on the DB itself with reads/writes - which I guess is okay for now.
Yes, we have some dummy timeout right now based on the submitted amount of pages (set to 200 by default multiplicated with a timeout value per page). Which could potentially lead to timeout errors even when the output is correctly produced say due to slower processing. The amount of pages is still not part of the processing request.
Yes, there will be a race condition to the internal queue resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to add to the complexity but we also need the parameterization of the processor in the mix, so
processor · parameters · set(pages) · fileGrp
😬There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that. Yes, that's right. (If the METS already contains results for a page-fileGrp.) But for that we don't have to do anything (detecting overlaps etc) – this error will be coming from the processor itself (it already contains logic looking for conflicts in the output fileGrp / page list).
This sounds confusing to me. My above analysis still stands IMHO – I don't see any reason why we should look at the identity of the processor. (But I agree with @kba that if we do, then it would really have to be the combination of processor and parameters.) It's the just pages and the fileGrps that cause a dependency.
Regarding timeouts (using the actual number of pages in the Processor Server model, adding a timeout mechanism in the Processing Worker) – should we track that in a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we should not try to do early prevention of errors since it makes the implementation more complex on the Processing Server side.
For the sake of keeping it simple - it should be just pages and fileGrp of a workspace then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1074
Yes. If we want to do anticipation of workflow conflicts then it would be what currently is done in
ocrd process
viaocrd.task_sequence.validate_tasks()
– checking that--overwrite
That could be done in the Workflow Server statically – before sending requests.