Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing-Server #974

Merged
merged 230 commits into from
Apr 2, 2023
Merged

Processing-Server #974

merged 230 commits into from
Apr 2, 2023

Conversation

joschrew
Copy link
Contributor

@joschrew joschrew commented Jan 24, 2023

This pull request implements the processing server (OCR-D/spec#222). The processing server starts a rabbitmq, mongodb and processing workers via ssh and docker. It receives calls to run ocr-d processors. The jobs are enqueued and the workers read the jobs from the queue and execute them. In the mongdb the jobs are stored and status changes of the jobs are reflected in the database and can be queried through the processing workers API.

To start a worker it must be defined in the configuration file. There the host of the worker has to be set. It is expected that the worker executables (e.g. ocrd-dummy) are available in the PATH. PATH can be modified via ~/.bash_profile or ~/.profile.

Example configuration file:

process_queue:
  address: localhost
  port: 5672
  credentials:
    username: testuser
    path_to_privkey: /home/testuser/.ssh/testuser.key
  ssh:
    username: testuser
    path_to_privkey: /home/testuser/.ssh/testuser.key
database:
  address: localhost
  port: 27018
  credentials:
    username: admin
    password: admin
  ssh:
    username: testuser
    path_to_privkey: /home/testuser/.ssh/testuser.key
hosts:
  - address: localhost
    username: testuser
    path_to_privkey: /home/testuser/.ssh/testuser.key
    workers:
      - name: ocrd-dummy
        number_of_instance: 1
        deploy_type: native

Example calls for the processing-server endpoints:

Run a processor:
curl 'http://localhost:8080/processor/ocrd-dummy' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "path": "/home/testuser/.local/share/ocrd-workspaces/data/mets.xml", "input_file_grps": ["OCR-D-IMG"], "output_file_grps": ["IMG-COPY-1"], "parameters": { "copy_files": true } }'

Request processor status:
curl 'http://localhost:8080/processor/ocrd-dummy/<insert-job-id-here>'

List available processors:
curl 'http://localhost:8080/processor'

Get information about a single processor
curl 'http://localhost:8080/processor/ocrd-dummy'

joschrew and others added 30 commits December 7, 2022 16:25
We decided to use only single quotes for strings to make it consistent.
I kept docstrings in triple double quotes because of PEP 257.
They were added to give additional information but are not needed here
any more in this place
…-processing-worker

add bashlib processing worker, require Python 3.7
@MehmedGIT MehmedGIT self-requested a review March 27, 2023 15:41
Copy link
Contributor

@MehmedGIT MehmedGIT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good to be merged now!

Comment on lines 230 to 233
try:
# Only checks if the process queue exists, if not raises ValueError
self.rmq_publisher.create_queue(processor_name, passive=True)
except ChannelClosedByBroker as error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this to raise just a ValueError according to the doc if the queue doesn't exist, however, that seems to not be the case. The channel is closed.

Comment on lines 239 to 242
finally:
# Reconnect publisher - not efficient, but works
# TODO: Revisit when reconnection strategy is implemented
self.connect_publisher(enable_acks=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, a reconnection is needed here. I know that it's more efficient to just open a new channel but don't want to deal with that now - will be more appropriate to invest time to implement correctly the reconnection scheme internally in the rabbitmq_utils.

Fix the silly mistake I made.
MehmedGIT added a commit that referenced this pull request Mar 28, 2023
@MehmedGIT MehmedGIT requested a review from bertsky March 28, 2023 10:23
Makefile Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@kba
Copy link
Member

kba commented Mar 29, 2023

Okay, another note that will potentially be raised soon. Currently, the supported processing-worker CLI is still the old one:

> > ```shell > # 1. > $ ocrd processing-worker --queue= --database= > > # 2. > $ --queue= --database= > ``` > > We (me and @joschrew) are aware that in the [spec](https://github.com/OCR-D/spec/blob/master/web_api.md) there is a change that is not adapted here, yet: > > ```shell > # 1. Use ocrd CLI bundled with OCR-D/core > $ ocrd server --type=worker --queue= --database= > > # 2. Use processor name > $ --server --type=worker --queue= --database= > ``` > > Where the `--type` can be either: `worker` (processing worker) > > ```shell > ocrd server --type=worker --queue= --database= > ``` > > or `server` (the REST API wrapper based on #884). > > ```shell > ocrd server --type=server --database= > ``` > > However, it's a bad idea to extend the processing worker code to support the REST API processing server (aka standalone processing worker that has nothing to do with the bigger Processing Server in the spec and does not need the queues). > > Now, try to imagine a good way to explain to the OCR-D community without confusing them: > > 1. why the `` is a server, but is not a server when `--type=worker` and is referred to with `processing-worker` and no direct requests can be sent > 2. why the `` is a server, and actual server when `--type=server` and is referred to with `processing-server` > 3. why both are grouped together under `ocrd server...` or `... --server ...` but potentially implemented together under `processing_worker.py` > 4. why Processing Server (PS-big) and processing servers (PS-small), standalone ocrd processors, are different concepts but both referred to with `processing server`. > > Sounds confusing? It's even more when it has to be implemented in a clean way. So there are 3 main reasons to not adapt that yet: > > * The priority changed a bit and the higher priority now is to deploy working Processing Server, Workflow Server, and Workspace Server together on a live VM instance, so KITODO can use that to continue their development. > * Identify and fix problems that arise when combining the 3 servers above from 2 different repositories ([Processing-Server #974](https://github.com//pull/974) and reference WebAPI impl). > * To not complicate the current implementation without first trying to think how to separate concepts properly and avoid potential problems. Ideally: > > 1. the standalone processing servers (aka server processing workers) should be implemented as a separate class (name suggestions?), sibling of `ProcessingWorker`, and both will share the common methods. > 2. the CLI for both should be separated with improved naming conventions.

continued in #1032

Co-authored-by: Robert Sachunsky <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants