Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Debian/Ubuntu packaging of ocrd_all components and OCR models #130

Open
kba opened this issue Jul 23, 2020 · 4 comments
Open

RFC: Debian/Ubuntu packaging of ocrd_all components and OCR models #130

kba opened this issue Jul 23, 2020 · 4 comments

Comments

@kba
Copy link
Member

kba commented Jul 23, 2020

Now that a solution to the conflicting dependency problem is imminent, we should discuss how we can reduce build times and simplify management of OCR models by supporting OS package management.

I see three areas where package management can improve ocrd_all:

  1. Providing packages for processors with full dependencies, e.g. with AppImage as @stweil proposed.
  2. Providing packages for compile-intensiv packages, i.e. tesseract and olena
  3. Packaging models, like the GT4HistOCR-based ones, for tesseract, calamari, ocropy and kraken

Ad 1.: The only way this can work without creating system-wide dependency conflicts would be basically a repackaging of the maximum docker image. This is also of interest and AppImage is probably a good solution

Ad 2.: Since the scope is limited (tesseract and olena), @mikegerber has already built debian/ubuntu packages for olena and @AlexanderP builds tesseract for Launchpad's PPA, this would be relatively straightforward

Ad 3.: For tesseract models we can take the official tesseract-ocr-* models as a blueprint. ocropy and kraken models can also be packaged relatively easy. For calamari models, we should probably agree on a convention where and how models should be stored (ping @maxnth @andbue @chreul if you have already ideas/plans in that regard)

The model packaging in particular would be of benefit also outside the OCR-D "ecosphere".

My questions for the ocrd_all users/developers:

  1. Which of the three approaches are worth exploring in your opinion?
  2. Who has experience in Debian/Ubuntu packaging and can help with setting up the tooling necessary?
  3. How should we distribute the models? PPA seems like a straightforward choice but only supports Ubuntu (?) not Debian. Another proposal was https://packagecloud.io. Or could we build a repository as a GitHub pages static site or use GitHub releases as a pseudo-repository?

Feedback and pointers to solutions are very welcome.

@mikegerber
Copy link
Contributor

mikegerber commented Jul 23, 2020

Q&D ocrd AppImage to be built with pkg2appimage:

# Based on https://github.com/AppImage/pkg2appimage/blob/9249a99e653272416c8ee8f42cecdde12573ba3e/recipes/ProcDump.yml


app: ocrd

ingredients:
  dist: bionic
  sources:
    - deb http://us.archive.ubuntu.com/ubuntu/ bionic bionic-updates bionic-security main universe
    - deb http://us.archive.ubuntu.com/ubuntu/ bionic-updates main universe
    - deb http://us.archive.ubuntu.com/ubuntu/ bionic-security main universe
  packages:
    - python3.6-venv
  script:

script:
  - virtualenv --python=python3 usr
  - ./usr/bin/pip3 install ocrd
  - ./usr/bin/pip3 freeze | grep "^ocrd==" | cut -d "=" -f 3 > ../VERSION

  # XXX at least pkg2appimage needs a desktop file and an icon, might want to use something
  # else to build, but this is a POC, so...
  - mkdir -p usr/share/applications/
  - cat > usr/share/applications/ocrd.desktop <<\EOF
  - [Desktop Entry]
  - Name=ocrd
  - Exec=ocrd
  - Icon=ocrd
  - Comment=OCR-D core
  - Categories=Office;
  - Type=Application
  - Terminal=true
  - EOF
  - touch usr/share/icons/hicolor/512x512/apps/ocrd.png # FIXME
  - cp usr/share/icons/hicolor/512x512/apps/ocrd.png .
  - cp usr/share/applications/ocrd.desktop .

This has some quirks like .desktop and the icon and the handling of the working directory, but it was pleasingly easy to build this:

% ~/devel/app-image-ocrd/out/ocrd-2.12.2.glibc2.3.3-x86_64.AppImage workspace -d /tmp/actevedef_718448162 get-id 
http://resolver.staatsbibliothek-berlin.de/SBB00008F1000000000

(ugly bagit.py error message removed)

@mikegerber
Copy link
Contributor

mikegerber commented Jul 29, 2020

My opinion(!) on this:

If OCR-D has everything either

  1. pip installable (for Python source)
  2. apt installable on Ubuntu LTS (everything else)
    a. OCR-D things not covered by pip
    b. binary dependencies like Olena or Tesseract

then - with a little experience - it is easy to build and maintain dependency-isolated AppImages or Docker containers. I would aim for this situation.

This way it's possible to:

  1. Just put an AppImage into /usr/local/bin and have a working processor
  2. If you choose so, you can still have it wild and install everything "by hand"

Packaging everything into classical Ubuntu packages will produce the same Gordian knot of dependency problems as the original ocrd_all concept. (I call it Gordian knot because I am currently upgrading ocrd_calamari to TF2 and now need TF2.3 to solve some issues → I am sure some other processor will have issues with that.)

(There are some quirks with AppImage we should have a look at, but it looks really good.)

@mikegerber
Copy link
Contributor

mikegerber commented Jul 29, 2020

(My fat container approach https://travis-ci.org/github/mikegerber/my_ocrd_workflow has the same Gordian knot, I just include fewer processors.)

@mikegerber
Copy link
Contributor

And you can then still stick an AppImage into a Ubuntu package. It's a bit perverse but easy to do.

(Needs a bit more work if you have e.g. a classical ocrd_olena package and then another one that includes everything as an AppImage.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants