Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite: Pythonic, ocrd v3, utilise page-level annotation #28

Open
wants to merge 29 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ce689f7
Pythonic (and ocrd>=3.0) rewrite
bertsky Feb 17, 2025
c2f874d
re-add ocrd-tool.json as symlink
bertsky Feb 17, 2025
1971737
add params image_feature_filter/selector
bertsky Feb 18, 2025
6a82bfb
fix outline coordinates (by updating ' Page/@image_*')
bertsky Feb 18, 2025
f161ef5
multipage: raise instead of log when gs fails
bertsky Feb 18, 2025
ae8d86e
multipage metadata: utilise more DOCINFO (Document Information Dictio…
bertsky Feb 18, 2025
14b2f9f
check if any text exists on textequiv_level, warn if not
bertsky Feb 18, 2025
6076a96
add parameter 'multipage_only', removing single-page files finally
bertsky Feb 18, 2025
9d3e30b
title metadata: avoid relatedItem
bertsky Feb 19, 2025
69786ce
producer metadata: use pkg name and version
bertsky Feb 19, 2025
7d8e141
pagelabel parameter: add pagelabel value (using @ORDER/LABEL)
bertsky Feb 19, 2025
dda1768
add ALTO2PDF processor (converting ALTO→PAGE first, using 2. input fi…
bertsky Feb 20, 2025
a779985
multipage: escape/encode strings properly
bertsky Feb 21, 2025
ef46e63
multipage: add MODS as extra XMP file, add logical structMap as bookm…
bertsky Feb 21, 2025
7c89da8
finalize processor docstrings
bertsky Feb 21, 2025
4cb800c
update makefile/dockerfile
bertsky Feb 21, 2025
1918a0e
multipage: do not string-format MODS XMP stream, but do avoid non-ASC…
bertsky Feb 21, 2025
4f6e2e3
multipage: fix MODS author retrieval
bertsky Feb 21, 2025
a1cba43
multipage: make logical structMap parser more robust
bertsky Feb 21, 2025
0591030
add some fonts as downloadable resources
bertsky Feb 21, 2025
d04fb69
refactor to avoid get_physical_pages on ClientSideOcrdMets
bertsky Feb 21, 2025
dfaa249
add basic tests
bertsky Feb 21, 2025
181f9b3
altotopdf: improve logging
bertsky Feb 22, 2025
e649b8f
tests: work around core#1149 by downloading remotely
bertsky Feb 22, 2025
5a0c3ce
tests: add some METS URLs, test all config / workspace combinations, …
bertsky Feb 22, 2025
df5bfa4
update readme, add CI+CD
bertsky Mar 4, 2025
057be92
setuptools: adapt pkg discovery to repo subdirectory
bertsky Mar 4, 2025
7f0d04c
fix+improve dockerfile
bertsky Mar 4, 2025
7021614
prepare for Github transfer UB-Mannheim→OCR-D
bertsky Mar 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Continuous integration for ocrd_pagetopdf

name: Python CI

on:
push:
branches: [ "master" ]
pull_request:
workflow_dispatch:
inputs:
upterm-session:
description: 'Run SSH login server for debugging'
default: False
type: boolean

jobs:
ci_test:
name: CI build and test
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']

steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Setup upterm session
# interactive SSH logins for debugging
if: github.event.inputs.upterm-session == 'true'
uses: lhotari/action-upterm@v1
- name: Install dependencies
run: |
sudo make deps-ubuntu
make deps
- name: Install package
run: |
python3 --version
make install
pip check
- name: Run tests
run: |
pip install pytest
make test
47 changes: 47 additions & 0 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: Dockerhub CD

on:
push:
branches: [ "master" ]
workflow_dispatch:
inputs:
docker-tagname:
description: Tag name of the Docker image
default: 'ocrd/pagetopdf'

env:
DOCKER_TAGNAME: ${{ github.evenv.inputs.docker-tagname || 'ocrd/pagetopdf' }}

jobs:

build:

runs-on: ubuntu-latest
permissions:
packages: write
contents: read

steps:
- uses: actions/checkout@v4
- # Activate cache export feature to reduce build time of image
name: Set up Docker Buildx
uses: docker/setup-buildx-action@v2
- name: Build the Docker image
run: make docker DOCKER_TAG=${{ env.DOCKER_TAGNAME }}
- name: Login to Dockerhub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_PASSWORD }}
- name: Push image to Dockerhub
run: docker push ${{ env.DOCKER_TAGNAME }}
- name: Alias the Docker image for GHCR
run: docker tag ${{ env.DOCKER_TAGNAME }} ghcr.io/${{ github.repository }}
- name: Login to GitHub Container Registry
uses: docker/login-action@v2
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Push image to Github Container Registry
run: docker push ghcr.io/${{ github.repository }}
29 changes: 29 additions & 0 deletions .github/workflows/pypi.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: PyPI CD

on:
release:
types: [published]
workflow_dispatch:

jobs:
publish:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel build twine
pip install -r requirements.txt
- name: Build and publish
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: twine upload --verbose dist/ocrd*${{ github.ref_name }}*{tar.gz,whl}
55 changes: 55 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
/lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# vim tmp
*.swp
*.swo

# emacs bkup
*~

# temporary clone of assets
tests/assets/
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "repo/assets"]
path = repo/assets
url = https://github.com/OCR-D/assets
30 changes: 21 additions & 9 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,26 +1,38 @@
FROM ocrd/core
ARG DOCKER_BASE_IMAGE
FROM $DOCKER_BASE_IMAGE

ARG VCS_REF
ARG BUILD_DATE
LABEL \
maintainer="https://ocr-d.de/kontakt" \
org.label-schema.vcs-ref=$VCS_REF \
org.label-schema.vcs-url="https://github.com/UB-Mannheim/ocrd_pagetopdf" \
org.label-schema.vcs-url="https://github.com/OCR-D/ocrd_pagetopdf" \
org.label-schema.build-date=$BUILD_DATE

ENV DEBIAN_FRONTEND noninteractive
ENV PREFIX=/usr/local

RUN apt-get update && apt-get install -y openjdk-8-jdk-headless wget git gcc unzip
# avoid HOME/.local/share (hard to predict USER here)
# so let XDG_DATA_HOME coincide with fixed system location
# (can still be overridden by derived stages)
ENV XDG_DATA_HOME /usr/local/share
# avoid the need for an extra volume for persistent resource user db
# (i.e. XDG_CONFIG_HOME/ocrd/resources.yml)
ENV XDG_CONFIG_HOME /usr/local/share/ocrd-resources

WORKDIR /build
COPY ptp ./ptp
COPY ocrd-pagetopdf .
WORKDIR /build/ocrd_pagetopdf
COPY pyproject.toml .
COPY ocrd_pagetopdf ./ocrd_pagetopdf
COPY ocrd-tool.json .
COPY requirements.txt .
COPY README.md .
COPY Makefile .
RUN make install PREFIX=/usr/local SHELL="bash -x"
# prepackage ocrd-tool.json as ocrd-all-tool.json
RUN python -c "import json; print(json.dumps(json.load(open('ocrd-tool.json'))['tools'], indent=2))" > $(dirname $(ocrd bashlib filename))/ocrd-all-tool.json
# install everything and reduce image size
RUN make deps-ubuntu deps install \
&& rm -fr /build/ocrd_pagetopdf

WORKDIR /data
ENV DEBIAN_FRONTEND teletype
CMD ["/usr/local/bin/ocrd-pagetopdf", "--help"]
VOLUME ["/data"]

92 changes: 50 additions & 42 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,64 +1,70 @@
PROJECT_NAME := ocrd_pagetopdf
SCRIPTS = ocrd-pagetopdf
DOCKER_TAG = ocrd/pagetopdf

PIP ?= $(shell which pip)

# Directory to install to ('$(PREFIX)')
PREFIX ?= $(if $(VIRTUAL_ENV),$(VIRTUAL_ENV),/usr/local)
PYTHON ?= python3
PIP ?= pip3

BINDIR = $(PREFIX)/bin
SHAREDIR = $(PREFIX)/share/$(PROJECT_NAME)

# BEGIN-EVAL makefile-parser --make-help Makefile
DOCKER_BASE_IMAGE = docker.io/ocrd/core:v3.0.4
DOCKER_TAG = ocrd/pagetopdf

help:
@echo ""
@echo " Targets"
@echo ""
@echo " deps-ubuntu Install system packages (on Debian/Ubuntu)"
@echo " deps Install python packages"
@echo " install Install the executable in $(PREFIX)/bin and the ocrd-tool.json to $(SHAREDIR)"
@echo " uninstall Uninstall scripts and $(SHAREDIR)"
@echo " deps-ubuntu Install system dependencies (on Debian/Ubuntu)"
@echo " deps Install Python dependencies via $(PIP)"
@echo " install Install the Python package via $(PIP)"
@echo " install-dev Install in editable mode"
@echo " build Build source and binary distribution"
@echo " docker Build Docker image"
@echo " test Run tests via Pytest"
@echo " repo/assets Clone OCR-D/assets to ./repo/assets"
@echo " tests/assets Setup test assets"
@echo ""
@echo " Variables"
@echo ""
@echo " PREFIX Directory to install to ('$(PREFIX)')"

# END-EVAL
@echo " DOCKER_TAG Docker container tag ($(DOCKER_TAG))"
@echo " PYTEST_ARGS Additional runtime options for pytest ($(PYTEST_ARGS))"
@echo " (See --help, esp. custom option --workspace)"

# Install system packages (on Debian/Ubuntu)
deps-ubuntu:
apt-get install -y python3 python3-venv default-jre-headless ghostscript
apt-get install -y python3 python3-venv default-jre-headless ghostscript git

# Install python packages
deps:
$(PIP) install ocrd # needed for ocrd CLI (and bashlib)
$(PIP) install -r requirements.txt

# Install the executable in $(PREFIX)/bin and the ocrd-tool.json to $(SHAREDIR)
install:
mkdir -p $(BINDIR)
for script in $(SCRIPTS);do \
sed 's,^SHAREDIR.*,SHAREDIR="$(SHAREDIR)",' $$script > $(BINDIR)/$$script ;\
chmod a+x $(BINDIR)/$$script ;\
done
mkdir -p $(SHAREDIR)
cp ocrd-tool.json $(SHAREDIR)
cp -r ptp $(SHAREDIR)
ifeq ($(findstring $(BINDIR),$(subst :, ,$(PATH))),)
@echo "you need to add '$(BINDIR)' to your PATH"
else
@echo "you already have '$(BINDIR)' in your PATH. good job."
endif
$(PIP) install .

# Uninstall scripts and $(SHAREDIR)
uninstall:
for script in $(SCRIPTS);do \
rm --verbose --force "$(BINDIR)/$$script";\
done
rm -rfv $(SHAREDIR)
make -C ocr-pagetopdf PREFIX=$(PREFIX) uninstall
install-dev:
$(PIP) install -e .

build:
$(PIP) install build wheel
$(PYTHON) -m build .

# TODO: once core#1149 is fixed, remove this line (so the local copy can be used)
test: export OCRD_BASEURL=https://github.com/OCR-D/assets/raw/refs/heads/master/data/
# Run test
test: tests/assets
$(PYTHON) -m pytest tests --durations=0 $(PYTEST_ARGS)

#
# Assets
#

# Update OCR-D/assets submodule
.PHONY: repos always-update tests/assets
repo/assets: always-update
git submodule sync --recursive $@
if git submodule status --recursive $@ | grep -qv '^ '; then \
git submodule update --init --recursive $@ && \
touch $@; \
fi

# Setup test assets
tests/assets: repo/assets
mkdir -p tests/assets
cp -a repo/assets/data/* tests/assets

# Build Docker image
docker:
Expand All @@ -67,3 +73,5 @@ docker:
--build-arg VCS_REF=$$(git rev-parse --short HEAD) \
--build-arg BUILD_DATE=$$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
-t $(DOCKER_TAG) .

.PHONY: help deps-ubuntu deps install install-dev docker
Loading