-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing-Server #974
Processing-Server #974
Conversation
We decided to use only single quotes for strings to make it consistent. I kept docstrings in triple double quotes because of PEP 257.
They were added to give additional information but are not needed here any more in this place
…-processing-worker add bashlib processing worker, require Python 3.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good to be merged now!
try: | ||
# Only checks if the process queue exists, if not raises ValueError | ||
self.rmq_publisher.create_queue(processor_name, passive=True) | ||
except ChannelClosedByBroker as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting this to raise just a ValueError
according to the doc if the queue doesn't exist, however, that seems to not be the case. The channel is closed.
finally: | ||
# Reconnect publisher - not efficient, but works | ||
# TODO: Revisit when reconnection strategy is implemented | ||
self.connect_publisher(enable_acks=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence, a reconnection is needed here. I know that it's more efficient to just open a new channel but don't want to deal with that now - will be more appropriate to invest time to implement correctly the reconnection scheme internally in the rabbitmq_utils.
Fix the silly mistake I made.
>
> ```shell
> # 1.
> $ ocrd processing-worker --queue= --database=
>
> # 2.
> $ --queue= --database=
> ```
>
> We (me and @joschrew) are aware that in the [spec](https://github.com/OCR-D/spec/blob/master/web_api.md) there is a change that is not adapted here, yet:
>
> ```shell
> # 1. Use ocrd CLI bundled with OCR-D/core
> $ ocrd server --type=worker --queue= --database=
>
> # 2. Use processor name
> $ --server --type=worker --queue= --database=
> ```
>
> Where the `--type` can be either: `worker` (processing worker)
>
> ```shell
> ocrd server --type=worker --queue= --database=
> ```
>
> or `server` (the REST API wrapper based on #884).
>
> ```shell
> ocrd server --type=server --database=
> ```
>
> However, it's a bad idea to extend the processing worker code to support the REST API processing server (aka standalone processing worker that has nothing to do with the bigger Processing Server in the spec and does not need the queues).
>
> Now, try to imagine a good way to explain to the OCR-D community without confusing them:
>
> 1. why the `` is a server, but is not a server when `--type=worker` and is referred to with `processing-worker` and no direct requests can be sent
> 2. why the `` is a server, and actual server when `--type=server` and is referred to with `processing-server`
> 3. why both are grouped together under `ocrd server...` or `... --server ...` but potentially implemented together under `processing_worker.py`
> 4. why Processing Server (PS-big) and processing servers (PS-small), standalone ocrd processors, are different concepts but both referred to with `processing server`.
>
> Sounds confusing? It's even more when it has to be implemented in a clean way. So there are 3 main reasons to not adapt that yet:
>
> * The priority changed a bit and the higher priority now is to deploy working Processing Server, Workflow Server, and Workspace Server together on a live VM instance, so KITODO can use that to continue their development.
> * Identify and fix problems that arise when combining the 3 servers above from 2 different repositories ([Processing-Server #974](https://github.com//pull/974) and reference WebAPI impl).
> * To not complicate the current implementation without first trying to think how to separate concepts properly and avoid potential problems. Ideally:
>
> 1. the standalone processing servers (aka server processing workers) should be implemented as a separate class (name suggestions?), sibling of `ProcessingWorker`, and both will share the common methods.
> 2. the CLI for both should be separated with improved naming conventions.
continued in #1032 |
Co-authored-by: Robert Sachunsky <[email protected]>
This pull request implements the processing server (OCR-D/spec#222). The processing server starts a rabbitmq, mongodb and processing workers via ssh and docker. It receives calls to run ocr-d processors. The jobs are enqueued and the workers read the jobs from the queue and execute them. In the mongdb the jobs are stored and status changes of the jobs are reflected in the database and can be queried through the processing workers API.
To start a worker it must be defined in the configuration file. There the host of the worker has to be set. It is expected that the worker executables (e.g. ocrd-dummy) are available in the PATH. PATH can be modified via ~/.bash_profile or ~/.profile.
Example configuration file:
Example calls for the processing-server endpoints:
Run a processor:
curl 'http://localhost:8080/processor/ocrd-dummy' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "path": "/home/testuser/.local/share/ocrd-workspaces/data/mets.xml", "input_file_grps": ["OCR-D-IMG"], "output_file_grps": ["IMG-COPY-1"], "parameters": { "copy_files": true } }'
Request processor status:
curl 'http://localhost:8080/processor/ocrd-dummy/<insert-job-id-here>'
List available processors:
curl 'http://localhost:8080/processor'
Get information about a single processor
curl 'http://localhost:8080/processor/ocrd-dummy'