Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for slim containers #386

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions Makefile-slim
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
export PYTHON ?= python3
VIRTUAL_ENV = $(CURDIR)/venv2
BIN = $(VIRTUAL_ENV)/bin
ACTIVATE_VENV = $(BIN)/activate
OCRD_MODULES = OCRD_CIS OCRD_TESSEROCR
OCRD_CIS = ocrd-cis-ocropy-binarize ocrd-cis-ocropy-dewarp
OCRD_TESSEROCR = ocrd-tesserocr-recognize ocrd-tesserocr-segment-region
PROCESSORS = $(foreach mod,$(OCRD_MODULES),$(foreach proc,$($(mod)), $(proc) ))
DELEGATORS = $(foreach proc,$(PROCESSORS),$(BIN)/$(proc))

slim-venv: docker-compose.yaml .env $(DELEGATORS) | $(VIRTUAL_ENV)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this would be needed if you added to the existing Makefile directly. We already have (sensible+configurable) definitions for VIRTUAL_ENV, OCRD_MODULES (not a variable but the true submodule name) and OCRD_EXECUTABLES (your PROCESSORS). There is even an existing delegator mechanism (used for sub-venvs on some modules).


# create a delegator to the processing server for the processor
$(BIN)/ocrd-%: | $(VIRTUAL_ENV)
@sed "s/{{\s*processor_name\s*}}/$(subst $(BIN)/,,$@)/" slim-containers-files/delegator_template.py > $@;
@chmod u+x $@


$(VIRTUAL_ENV): $(ACTIVATE_VENV)
. $(ACTIVATE_VENV) && $(MAKE) -C core install

%/bin/activate:
$(PYTHON) -m venv $(subst /bin/activate,,$@)
. $@ && pip install --upgrade pip setuptools wheel

# append the service to docker-compose for a processor
add_proc = sed -e "s/{{\s*processor_name\s*}}/$1/" -e "s/{{\s*processor_group_name\s*}}/\L$2/" \
slim-containers-files/docker-compose.processor.template.yaml >> docker-compose.yaml;
joschrew marked this conversation as resolved.
Show resolved Hide resolved

docker-compose.yaml:
@cat slim-containers-files/docker-compose.template.yaml > docker-compose.yaml
@$(foreach mod,$(OCRD_MODULES),$(foreach proc,$($(mod)),$(call add_proc,$(proc),$(mod))))

.env:
@rm -rf .env
joschrew marked this conversation as resolved.
Show resolved Hide resolved
@echo OCRD_PS_PORT=8000 >> .env
@echo OCRD_PS_MTU=1300 >> .env
@echo MONGODB_URL=mongodb://ocrd-mongodb:27017 >> .env
joschrew marked this conversation as resolved.
Show resolved Hide resolved
@echo RABBITMQ_URL=amqp://admin:admin@ocrd-rabbitmq:5672 >> .env
joschrew marked this conversation as resolved.
Show resolved Hide resolved

13 changes: 13 additions & 0 deletions slim-containers-files/Dummy-Core-Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# I need this because i need the network-for-slim-branch and this contains a comment how to make
# pudb run.
FROM ocrd/core:latest AS base
WORKDIR /build-ocrd
RUN apt install vim-tiny -y
joschrew marked this conversation as resolved.
Show resolved Hide resolved
RUN git clone https://github.com/ocr-d/core.git && \
cd core && \
git checkout network-for-slim-prep && \
joschrew marked this conversation as resolved.
Show resolved Hide resolved
#sed -i "290 i \ from pudb.remote import set_trace; set_trace(term_size=(160, 40), host='0.0.0.0', port=6900)" ocrd_network/ocrd_network/processing_server.py && \
joschrew marked this conversation as resolved.
Show resolved Hide resolved
joschrew marked this conversation as resolved.
Show resolved Hide resolved
make install-dev && \
pip install pudb
EXPOSE 6900
joschrew marked this conversation as resolved.
Show resolved Hide resolved
WORKDIR /data
24 changes: 24 additions & 0 deletions slim-containers-files/delegator_template.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/usr/bin/env python
import sys
from pathlib import Path
import subprocess

# Later the address (or rather the port) should be dynamic
processing_server_address = "http://localhost:8000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not the exposed OCRD_PS_PORT here? Or even better, instead of the host side (localhost), we should be fine with the Docker network's DNS:

Suggested change
processing_server_address = "http://localhost:8000"
processing_server_address = "http://ocrd-processing-server:8000"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that work (I mean the thing with the hostname, I think I know how to set set Port via env)? When I am on the host it cannot resolve ocrd-processing-server out of the box. What do I have to change additionally to make it work to query the container from the host via its service/host name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from the host (network), but from the virtual Docker network (i.e. from inside another container). See Compose documentation. (The port then is the container internal port BTW.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After trying some things I think I cannot solve this properly. First the hostname cannot be the service name because the delegator is run from the host. And I cannot read the port from .env because I don't know the working dir where the delegator is executed from. So I decided to set the port dynamic in the Makefile like the processor-name. The proposed suggestion at least does not work. Commit: 7736969

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First the hostname cannot be the service name because the delegator is run from the host.

Oh, right! Sorry, weak thinking on my side.

And I cannot read the port from .env because I don't know the working dir where the delegator is executed from.

Ok, good point. However, we could write all the .env values into the venv's activate by make, dynamically (i.e. setting the exact values used for the servers):

%/bin/activate:
	$(PYTHON) -m venv $(subst /bin/activate,,$@)
	. $@ && pip install --upgrade pip setuptools wheel
	@echo ". $(CURDIR)/.env" >> $@

IMO the .env should be the central source for configuration, so you should be able to modify it to your needs after it was generated.

Copy link
Author

@joschrew joschrew Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem with that approach, because it only works with the venv activated in bash (not csh or fish) and invoking executables directly would not work with this approach (e.g.: venv/bin/ocrd-some-processor ...). Btw in the example code this does not work: @echo ". $(CURDIR)/.env" >> $@, exporting is needed to be available to a started python process e.g. the processor.
But I think it is a good idea to use the .env for the config. For now I have implemented another approach approach: In the Makefile the env-path is set into the delegator which then reads the port from it. I know it is not ideal so I think we should talk about this again and agree on a proper solution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reverted the last commit regarding this and decided to use the proposed approach

processor_name = "{{ processor_name }}"

args = list(sys.argv)
if "-m" in args:
joschrew marked this conversation as resolved.
Show resolved Hide resolved
idx = args.index("-m")
metspath = args[idx + 1]
if Path(metspath).is_absolute():
print("absolute path is not supported")
exit(1)
args[idx + 1] = f"/data/{metspath}"
joschrew marked this conversation as resolved.
Show resolved Hide resolved


cmd = [
"ocrd", "network", "client", "processing", "processor",
processor_name, "--address", processing_server_address
]
subprocess.run(cmd + args[1:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Processing Server API is asynchronous, and so is its network client. So this CLI will not give the same user experience as the native CLI (for example, you cannot script these calls).

Thus, either we add some callback mechanism here and block until the job is done, or we switch to the Processor Server API which is synchronous (but has no client CLI yet).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be as simple as passing "--callback-url", "http://localhost:" + PORT + "/" here. The PORT should be generated before by an internal background task like so:

PORT = None
async def serve_callback():
    nonlocal PORT
    # set up web server
    def get(path):
        raise Exception("done")
    ...
    # run web server on arbitrary available port
    PORT = ...

server = asyncio.create_task(serve_callback)

Then after the subprocess.run, you can simply block by doing await server. The whole thing must be in a function that you asyncio.run(...).

Copy link
Author

@joschrew joschrew Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a waiting mechanism with the python build-in HTTPServer. I tried using your proposed async approach but discarded that before I could completely finish. Probably async is possible but it is way more complicated (that's why I finally opted out) and I think a "non-async" http server has the same effect and the waiting-costs for running a "normal" http server vs. a async http server is neglectable.
I cannot say I have much more than an idea of how it internally works but I think a python-process with an async server gets scheduled by the operating system as well anyway. The async-stuff with polling and so on is "only" python internal. It would gain time if we had another async process running alongside which could then be run when the server is waiting.
So I don't think async is helpful here anyway, that's why I use the synchrony server for waiting.

14 changes: 14 additions & 0 deletions slim-containers-files/docker-compose.processor.template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@

{{ processor_name }}:
extends:
file: slim-containers-files/{{ processor_group_name}}/docker-compose.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC in the final setup, when we have correct Dockerfiles and compose files in all modules, this will simply become {{ module_name }}/docker-compose.yaml?

service: {{ processor_name }}
command: ocrd network processing-worker --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue {{ processor_name }}
depends_on:
- ocrd-processing-server
- ocrd-mongodb
- ocrd-rabbitmq
# restart: The worker creates its queue but rabbitmq needs a few seconds to be available
Comment on lines +7 to +11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If timing is an issue, I suggest to change the dependency type:

Suggested change
depends_on:
- ocrd-processing-server
- ocrd-mongodb
- ocrd-rabbitmq
# restart: The worker creates its queue but rabbitmq needs a few seconds to be available
depends_on:
- ocrd-processing-server
ocrd-mongodb:
condition: service_started
ocrd-rabbitmq:
condition: service_started

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlining problem is the following: The container for the queue is started and running, but it needs 1-3 seconds that queue creation is possible. But the processing worker tries to create it's queue right away.

This suggestion (service_started) is not working because the queue-service is considered started from docker-compose but it is in reality not ready to be used right away although it is considered started. I have a similar issue in another project and already tried a few things solving this. I think the only solution to this problem from the docker side would be to implement a (manual) health-check for the rabbitmq container. But therefore I'd have to extend the rabbit-mq image which I do not want.

For this PR to function some extension to core is needed anyway. There I want to add an optional queue-creation-timeout to the worker startup so that it waits a few seconds with adding its queue or to try again a few times. But this restart-fix was the fastest way to do that that's why it is here and I agree that it should be removed finally. I will remark this as solved as soon as the needed changes to core are made (which need one change to this PR as well).

restart: on-failure:3
volumes:
- "$PWD/data:/data"
34 changes: 34 additions & 0 deletions slim-containers-files/docker-compose.template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
networks:
default:
driver: bridge
driver_opts:
com.docker.network.driver.mtu: ${OCRD_PS_MTU}

services:
ocrd-processing-server:
build:
# later real ocrd-core image should be referenced here
dockerfile: slim-containers-files/Dummy-Core-Dockerfile
args:
BASE_IMAGE: ubuntu:20.04
joschrew marked this conversation as resolved.
Show resolved Hide resolved
ports:
- ${OCRD_PS_PORT}:8000
volumes:
- "./slim-containers-files/ps-config.yaml:/ocrd-processing-server-config.yaml"
joschrew marked this conversation as resolved.
Show resolved Hide resolved
command: ocrd network processing-server -a 0.0.0.0:8000 /ocrd-processing-server-config.yaml

ocrd-mongodb:
image: mongo
joschrew marked this conversation as resolved.
Show resolved Hide resolved
# Ports are only needed during the implementation phase to test. To be removed later
ports:
- "27018:27017"
joschrew marked this conversation as resolved.
Show resolved Hide resolved

ocrd-rabbitmq:
image: rabbitmq:3-management
# Ports are only needed during the implementation phase to test. To be removed later
ports:
- "5672:5672"
- "15672:15672"
joschrew marked this conversation as resolved.
Show resolved Hide resolved
environment:
- "RABBITMQ_DEFAULT_USER=admin"
- "RABBITMQ_DEFAULT_PASS=admin"
joschrew marked this conversation as resolved.
Show resolved Hide resolved
15 changes: 15 additions & 0 deletions slim-containers-files/ocrd_cis/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
FROM ocrd/core:latest AS base
WORKDIR /build-ocrd
# Remove the next RUN, this is only to checkout my branch while the changes are not in core yet
RUN git clone https://github.com/ocr-d/core.git && \
cd core && \
git checkout network-for-slim-prep && \
make install

# Not based on ocrd_cis "original" Dockerfile. That seems out of date and in ocrd_all ocrd_cis is
# simply installed with pip so I do the same here
COPY ocrd_cis/ ./ocrd_cis/
COPY setup.py README.md LICENSE ocrd-tool.json Manifest.in ./
RUN pip install . && rm -rf /build-ocrd
# TODO: install models for ocrd-cis
WORKDIR /data
14 changes: 14 additions & 0 deletions slim-containers-files/ocrd_cis/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
services:
ocrd-cis-ocropy-binarize:
build:
context: ../../ocrd_cis
dockerfile: ../slim-containers-files/ocrd_cis/Dockerfile
command:
ocrd network processing-worker ocrd-cis-ocropy-binarize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue

ocrd-cis-ocropy-dewarp:
build:
context: ../../ocrd_cis
dockerfile: ../slim-containers-files/ocrd_cis/Dockerfile
command:
ocrd network processing-worker ocrd-cis-ocropy-dewarp --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
44 changes: 44 additions & 0 deletions slim-containers-files/ocrd_tesserocr/Dockerfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary at all? ocrd_tesserocr already contains a suitable Dockerfile...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just for proof-of-concept, so @joschrew does not need to keep multiple PR in sync.

Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
FROM ocrd/core:latest AS base
WORKDIR /build-ocrd-core
# Remove the next RUN, this is only to checkout my branch while the changes are not in core yet
RUN git clone https://github.com/ocr-d/core.git && \
cd core && \
git checkout network-for-slim-prep && \
make install

# copied from https://github.com/OCR-D/ocrd_tesserocr/blob/master/Dockerfile and modified
ARG VCS_REF
ARG BUILD_DATE
LABEL \
maintainer="https://ocr-d.de/kontakt" \
org.label-schema.vcs-ref=$VCS_REF \
org.label-schema.vcs-url="https://github.com/OCR-D/ocrd_tesserocr" \
org.label-schema.build-date=$BUILD_DATE

ENV DEBIAN_FRONTEND noninteractive
ENV PYTHONIOENCODING utf8

# avoid HOME/.local/share (hard to predict USER here)
# so let XDG_DATA_HOME coincide with fixed system location
# (can still be overridden by derived stages)
ENV XDG_DATA_HOME /usr/local/share

WORKDIR /build-ocrd
COPY setup.py .
COPY ocrd_tesserocr/ocrd-tool.json .
COPY README.md .
COPY requirements.txt .
COPY requirements_test.txt .
COPY ocrd_tesserocr ./ocrd_tesserocr
COPY Makefile .
RUN make deps-ubuntu && \
apt-get install -y --no-install-recommends \
g++ \
&& make deps install \
&& rm -rf /build-ocrd \
&& apt-get -y remove --auto-remove g++ libtesseract-dev make
RUN ocrd resmgr download ocrd-tesserocr-recognize Fraktur.traineddata
RUN ocrd resmgr download ocrd-tesserocr-recognize deu.traineddata

WORKDIR /data
VOLUME /data
14 changes: 14 additions & 0 deletions slim-containers-files/ocrd_tesserocr/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
services:
ocrd-tesserocr-recognize:
build:
context: ../../ocrd_tesserocr
dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile
command:
ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
ocrd network processing-worker ocrd-tesserocr-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue


ocrd-tesserocr-segment-region:
build:
context: ../../ocrd_tesserocr
dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile
command:
ocrd network processing-worker ocrd-tesserocr-segment-region --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should now use the ocrd-tesserocr-segment-region worker syntax instead to get instance caching, right?

12 changes: 12 additions & 0 deletions slim-containers-files/ps-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
process_queue:
address: ocrd-rabbitmq
port: 5672
skip_deployment: true
credentials:
username: admin
password: admin
joschrew marked this conversation as resolved.
Show resolved Hide resolved
database:
address: ocrd-mongodb
port: 27017
skip_deployment: true
joschrew marked this conversation as resolved.
Show resolved Hide resolved
hosts: []