Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New processor API #1240

Open
wants to merge 245 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
245 commits
Select commit Hold shift + click to select a range
833dac7
deprecate Processor.process()
bertsky Jun 10, 2024
3f4c7f9
fix #274: no default -I / -O
bertsky Jun 10, 2024
d2b5df3
workspace.download: fix typo in exception
bertsky Jun 18, 2024
9827c4d
Processor: factor-out show_resource(), delegate to resolve_resource()
bertsky Jun 24, 2024
38fd4aa
Processor: add setup(), run once in get_processor()
bertsky Jun 24, 2024
580988a
ocrd_cli_wrap_processor: fix workspace arg (not a kwarg)
bertsky Jun 24, 2024
224dfc5
Processor: refactor processing API…
bertsky Jun 24, 2024
9714aab
DummyProcessor: re-implement via new process_page_*
bertsky Jun 24, 2024
e5d4736
run_processor: adapt to process→process_workspace
bertsky Jun 24, 2024
809a01b
test DummyProcessor: adapt to new `download` default by setting `down…
bertsky Jun 24, 2024
dfe7f8e
test DummyProcessor: override process_workspace() by delegating to pr…
bertsky Jun 24, 2024
1550668
test builtin ocrd-dummy: adapt to consistent filename
bertsky Jun 24, 2024
75809b1
test processor: adapt to `input_file_grp` required
bertsky Jun 24, 2024
c429da5
test processor: adapt to `self.workspace` only during run_processor
bertsky Jun 24, 2024
295cdb6
Workspace.save_image_file: add kwarg file_path for predetermined loca…
bertsky Jun 26, 2024
e2cbcb9
Processor.process_page_pcgts: add kwargs and allow returning derived …
bertsky Jun 26, 2024
20a6a1c
Workspace.save_image_file: save DPI metadata, too
bertsky Jun 26, 2024
679ad85
Workspace.image_from_*: annotate 'DPI' in result dict and ensure it's…
bertsky Jun 26, 2024
565a3d9
test_workspace: adapt to image_from_* DPI and add assertions
bertsky Jun 26, 2024
46f81aa
autoload ocrd-tool.json and version from dist, executable name from e…
bertsky Jul 6, 2024
4dd83aa
adapt to new Processor init (override metadata/version/executable name)
bertsky Jul 6, 2024
4cafbcc
tests: adapt to new Processor init (override metadata/version/executa…
bertsky Jul 6, 2024
9c9a4c9
generate_processor_help: include process_workspace docstring, too
bertsky Jul 29, 2024
aa0bd68
get_processor: also run setup if instance_caching
bertsky Aug 8, 2024
99d1628
ocrd-tool CLI: pass class in context
bertsky Aug 12, 2024
12231b8
use more specific exception if parameters are invalid
bertsky Aug 12, 2024
d112f8f
run_processor w/ mem_usage: pass as args tuple
bertsky Aug 12, 2024
319ceaa
Processor.process_workspace: add fileGrp assertions
bertsky Aug 12, 2024
80590a9
process_page_pcgts: add (variadic) type checks
bertsky Aug 12, 2024
68ae8ff
run_processor: fix typo
bertsky Aug 12, 2024
2a18883
Processor init: deprecate passing workspace
bertsky Aug 12, 2024
b9338b4
docs: fix relative VERSION path
bertsky Aug 12, 2024
6ca6a40
docs: do/not exclude tests/src
bertsky Aug 12, 2024
bc9ec05
docs: add ocrd_network module
bertsky Aug 12, 2024
54f1d88
docs:regenerated rST
bertsky Aug 12, 2024
67633f5
test_mets_server: fix arg vs kwarg
bertsky Aug 13, 2024
751a1fe
mets_server: ClientSideOcrdMets needs OcrdMets-like kwargs (without d…
bertsky Aug 13, 2024
86d9569
Processor/CLI decorator: :fire: separate kwargs and constructor…
bertsky Aug 13, 2024
1f6f0c8
Processor / ocrd-tool.json: :fire: fileGrp cardinality checks…
bertsky Aug 13, 2024
9b417d6
test_processor: adapt to Processor init changes
bertsky Aug 13, 2024
fbe83c9
adapt to ocrd-tool.json cardinality changes
bertsky Aug 13, 2024
09dd54b
use up-to-date kwargs (avoiding old deprecations)
bertsky Aug 13, 2024
af880e4
hide/test expected deprecation warnings
bertsky Aug 13, 2024
e381a0f
improve output in case of assertion failures
bertsky Aug 13, 2024
874b506
Set VERSION to upcoming 3.0.0a1
kba Aug 14, 2024
5ffe3cb
CircleCI: use version 2.1
bertsky Aug 14, 2024
dd3046e
Merge branch 'master' into new-processor-api
kba Aug 14, 2024
93a742e
test_bashlib: use version verbatim
bertsky Aug 14, 2024
5117684
.
kba Aug 14, 2024
456cc6d
fix make spec
kba Aug 14, 2024
e03a906
Merge branch 'new-processor-api' of https://github.com/bertsky/core i…
kba Aug 14, 2024
7a9fc27
adapt lib.bash to handle prerelease suffixes like a1, b2, rc3
kba Aug 14, 2024
90afb8a
process_page_pcgts must return OcrdProcessResult
kba Aug 14, 2024
70ad191
bashlib ocrd__minversion: compare prerelease suffix alphabetically
kba Aug 15, 2024
678158a
Merge pull request #7 from OCR-D/bashlib-version-yak-shaving
bertsky Aug 15, 2024
228272b
fix ocrd_tool.schema.yml cardinality oneOf syntax, update spec
bertsky Aug 15, 2024
5aba83b
bashlib: fix ocrd__minversion test syntax
bertsky Aug 15, 2024
3d094d6
reimplement OcrdPageResult
kba Aug 15, 2024
f8b6896
update spec (with new ocrd_tool.schema)
bertsky Aug 15, 2024
72eb75b
update spec to v3.25.0, ocrd_tool.schema.yml
kba Aug 15, 2024
75cb20c
process_page_file: fix handling of images
kba Aug 15, 2024
9a1c7ad
process_page_pcgts: remove output_file_id, replace OcrdPageResult.fil…
kba Aug 15, 2024
60ad424
OcrdPageResultImage requires passing alternative_image w/o filename set
kba Aug 15, 2024
50dfdd6
Processor.verify: handle -1 case
kba Aug 15, 2024
53f2634
processor.base: remove obsolete FIXME
kba Aug 15, 2024
d210afa
Processor.process_page_pcgts: update docstring for file_path/alternat…
kba Aug 15, 2024
5718cf9
export OcrdPageResult{Image} from ocrd.processor
kba Aug 15, 2024
f5f3145
Processor.process.page_pcgts: simplify references in docstring
bertsky Aug 15, 2024
db68bb5
Merge branch 'processor-result-object' of https://github.com/OCR-D/co…
kba Aug 15, 2024
7045318
allow "from ocrd_models import OcrdPage
kba Aug 15, 2024
a9dba73
Merge branch 'processor-result-object' into new-processor-api
kba Aug 15, 2024
3220e3f
:memo: v3.0.0a1
kba Aug 15, 2024
e1f5744
Update CHANGELOG.md
kba Aug 16, 2024
80d42f1
ocrd: more convenience imports
bertsky Aug 16, 2024
0e57b4b
ocrd.cli: more fix module import order, export help cmd
bertsky Aug 16, 2024
9cfd70c
fix imports
bertsky Aug 16, 2024
95212b5
fix type assertion
bertsky Aug 16, 2024
4aa288a
ocrd_utils: forgot to export scale_coordinates at toplvl
bertsky Aug 16, 2024
8044e60
fix 9cfd70cffcc
bertsky Aug 16, 2024
21ff810
fix 9cfd70cffcc (revert to wrong import order to avoid circle)
bertsky Aug 16, 2024
4077e8d
s,PcGtsType,OcrdPage,
kba Aug 16, 2024
cd4c96c
add config.OCRD_DOWNLOAD_INPUT
kba Aug 19, 2024
3125255
define self.logger in processor base constructor
kba Aug 19, 2024
0adb9fb
Merge branch 'master' into new-processor-api
kba Aug 19, 2024
dcf7c52
OcrdPage proxy object for PcGtsType, including etree and mappings
kba Aug 19, 2024
cf45d8b
Processor.base: have a (hopefully) thread-safe logger for the base class
kba Aug 19, 2024
785d607
Processor.zip_input_files: warning instead of exception for missing i…
bertsky Aug 20, 2024
b12849d
Processor.zip_input_files: introduce NonUniqueInputFile exception
bertsky Aug 20, 2024
95d3658
Processor.process_workspace: zip_input_files w/o require_first
bertsky Aug 20, 2024
2e2bda6
Merge remote-tracking branch 'origin/download-files-config-var' into …
bertsky Aug 20, 2024
c729841
Processor.zip_input_files: introduce MissingInputFile exception and c…
bertsky Aug 20, 2024
7df81af
OcrdPage: clearer docstring
kba Aug 20, 2024
0ab6942
jsonschema: switch from draft6 to draft2019-09
kba Aug 20, 2024
66c50b3
require jsonschema>4 for draft 2019-09
kba Aug 20, 2024
94e2e60
OcrdToolValidator: set defaults, handle deprecated
kba Aug 20, 2024
2e7bdc2
processor.base: validate/setdefault ocrd-tool.json on first access
kba Aug 20, 2024
346f166
update spec and ocrd_tool.schema.yml
kba Aug 20, 2024
577baa5
processor parameter decorator: no '{}' default (unnecessary)
bertsky Aug 20, 2024
f00ecda
Processor: add error handling…
bertsky Aug 20, 2024
fdd5d16
ocrd_utils.config: add variables to module docstring
bertsky Aug 21, 2024
6d87f9e
improve docstrings, re-generate docs
bertsky Aug 21, 2024
9942bbe
Processor.zip_input_files: more verbose log msg
bertsky Aug 21, 2024
8a584e9
test_processor: test for specific exception
bertsky Aug 21, 2024
8077d45
test_processor: fix missing import
bertsky Aug 21, 2024
6b68f7a
Merge pull request #12 from bertsky/new-processor-api-input-file-errors
bertsky Aug 21, 2024
1b4cd3c
Merge branch 'new-processor-api' into processor-logger
bertsky Aug 21, 2024
7f3bfa2
Merge pull request #10 from OCR-D/processor-logger
bertsky Aug 21, 2024
d4d40e3
Merge pull request #11 from OCR-D/ocrd-page-with-etree
bertsky Aug 21, 2024
111a52e
Merge pull request #13 from OCR-D/validate-ocrd-tool-runtime
bertsky Aug 21, 2024
cf7b193
OcrdPage: fix typeing typo
bertsky Aug 21, 2024
9af8670
dummy_processor: fix typos from logging
bertsky Aug 21, 2024
c6d9736
tests report.is_valid: improve output on failure
bertsky Aug 21, 2024
161cf0c
JsonValidator: fix deprecation warning (by actually checking instance)
bertsky Aug 21, 2024
b2e6485
predefine union types OcrdFileType and OcrdPageType
bertsky Aug 21, 2024
822d731
processor CLI --debug: set all to ABORT (not just MISSING_OUTPUT)
bertsky Aug 21, 2024
3a7a771
:memo: changelog
bertsky Aug 21, 2024
2bdb6c4
:package: v3.0.0a2
kba Aug 22, 2024
00bd6fe
remove make *-workaround, we will not do that for v3+
kba Aug 22, 2024
d777527
Processor.parameter: only validate when set…
bertsky Aug 22, 2024
7998aae
get_processor: ensure passing non-empty parameter, rely on `_setup` t…
bertsky Aug 22, 2024
cc8592b
test_processor: adapt, check required parameters
bertsky Aug 22, 2024
45e556d
improve _setup docstring
bertsky Aug 22, 2024
d4c802b
Processor._setup: raise with full ParameterValidator report
bertsky Aug 22, 2024
b28fefb
get_processor: parameter only as kwarg
bertsky Aug 22, 2024
642938b
tests: adapt for get_processor parameter only as kwarg
bertsky Aug 22, 2024
f5e5c54
Processor.parameter: make the bound dict read-only
bertsky Aug 22, 2024
f2d53a6
Processor.parameter: move ParameterValidator back to setter, convert …
bertsky Aug 22, 2024
7297ca2
Processor.parameter: frozendict instead of mappingproxy, add test
bertsky Aug 22, 2024
6cd4a34
introduce Processor.shutdown to be overridden (called at deinit or pa…
bertsky Aug 22, 2024
407bff8
Processor: introduce `max_instances` class attribute
bertsky Aug 23, 2024
c9fbb2c
get_cached_processor: set lru_cache maxsize from min(cfg,class) at ru…
bertsky Aug 23, 2024
9c212a9
test get_processor instance_caching w/ max_instances
bertsky Aug 23, 2024
a413f04
test get_processor instance_caching w/ clear_cache
bertsky Aug 23, 2024
870523c
:package: v3.0.0a2
kba Aug 22, 2024
20bb6d1
remove make *-workaround, we will not do that for v3+
kba Aug 22, 2024
faa59a8
Processor.metadata_location property to specify where in the package …
kba Aug 23, 2024
5819c81
Processor.verify: always check cardinality (as we now have the defaul…
bertsky Aug 23, 2024
4f88f1d
fix --log-filename (6fc606027a): apply in ocrd_cli_wrap_processor
bertsky Aug 24, 2024
d621f36
fix exception
bertsky Aug 24, 2024
4868fb1
adapt to PIL.Image moved constants
bertsky Aug 24, 2024
da72c0a
ocrd_utils: add parse_json_file_with_comments
bertsky Aug 24, 2024
ca78b94
cli.workspace: pass fileGrp as well, improve description
bertsky Aug 24, 2024
cf41745
OcrdMets.add_agent: does not have positional args
bertsky Aug 24, 2024
cadc6e6
remove misplaced kwargs from run_processor
bertsky Aug 24, 2024
7966057
Processor.metadata: refactor…
bertsky Aug 24, 2024
bba142d
bashlib input-files: adapt, allow passing ocrd-tool.json path and exe…
bertsky Aug 24, 2024
32cdc5a
add to pylint karma
bertsky Aug 24, 2024
a95f269
update pylintrc
bertsky Aug 24, 2024
50c088e
processor.metadata_location: use self.__module__ not __package__
kba Aug 24, 2024
ad8c76e
Merge pull request #17 from OCR-D/new-processor-api-parameter-setup
bertsky Aug 24, 2024
8211237
pylint: try ignoring generateds (again)
bertsky Aug 25, 2024
b53724e
Merge pull request #14 from bertsky/new-processor-api-parameter-setup
bertsky Aug 25, 2024
3e2700c
:memo: update changelog
bertsky Aug 25, 2024
342df58
test_bashlib: allow testing prereleases successfully
bertsky Aug 25, 2024
11ed8c5
Processor.process_page_file / OcrdPageResultImage: allow PageType ins…
bertsky Aug 25, 2024
69571fe
Merge branch 'master' into new-processor-api
kba Aug 26, 2024
77e31f2
:package: v3.0.0b1
kba Aug 26, 2024
d3ee57c
:fire: bad no good terrible hack to fix integration_test
kba Aug 26, 2024
0245f4b
generate_processor_help: avoid repeating docstrings from superclass
bertsky Aug 27, 2024
efe4201
Processor.process_workspace: abort anyway if too many failures (OCRD_…
bertsky Aug 27, 2024
fce7627
adapt tests for OCRD_MAX_MISSING_OUTPUTS
bertsky Aug 27, 2024
a50d0bb
Merge pull request #19 from OCR-D/new-processor-api-fix-editable
bertsky Aug 27, 2024
c08166e
Processor: add per-page timeouts and parallelism…
bertsky Aug 27, 2024
c3a8380
add tests for processor per-page timeout and parallelism
bertsky Aug 27, 2024
b1b7a49
:memo: update changelog
bertsky Aug 27, 2024
9b80ae1
ClientSideOcrdMets: use same logger name prefix as server
bertsky Aug 28, 2024
be6b59d
Processor: fix ignore (negative/zero) cases for max_workers / max_pag…
bertsky Aug 28, 2024
0b5286f
test_mets_server: use tmpdir to avoid side effects between suites
bertsky Aug 28, 2024
61e1042
test processor timeout/parallel: avoid side effects to dummy tool json
bertsky Aug 28, 2024
e395b56
tess: adapt to wording of exceptions
bertsky Aug 28, 2024
a59ba6a
ClientSideOcrdMets: partial revert of 9b80ae17ef
bertsky Aug 28, 2024
554a67d
disableLogging: re-instate root logger, to
bertsky Aug 28, 2024
1114cd9
test-logging: also remove ocrd.log from tempdir
bertsky Aug 28, 2024
ce6d239
Processor: fix 7966057f (deprecated passing of ocrd_tool or version v…
bertsky Aug 28, 2024
df99160
Processor.generate_processor_help: forgot to include --log-filename
bertsky Aug 28, 2024
eb74fab
bashlib: re-add --log-filename, implement as stderr redirect
bertsky Aug 28, 2024
8565a8f
test_processor: add legacy (v2-style) dummy case
bertsky Aug 28, 2024
abe069a
:memo: update changelog
bertsky Aug 28, 2024
11f9264
:memo: update readmes (esp. new config variables)
bertsky Aug 28, 2024
ca88122
:package: v3.0.0b2
kba Aug 30, 2024
837aba7
ocrd_utils.config: add reset_defaults()
bertsky Aug 29, 2024
85e96ff
add test for OcrdEnvConfig.reset_defaults()
bertsky Aug 29, 2024
8911c3b
Processor: improve processing log messages
bertsky Aug 30, 2024
98d97fc
ocrd.cli doc: don't rewrap description lists
bertsky Aug 30, 2024
cb758e8
:package: v3.0.0b3
kba Aug 30, 2024
1ed38a6
Processor.metadata_location: find location package prefix (necessary …
bertsky Aug 30, 2024
7d98c27
Processor: log when max_workers / max_page_seconds are in effect
bertsky Sep 1, 2024
6b23b65
Workspace.reload_mets: fix for METS server case
bertsky Sep 1, 2024
cac05cd
:memo: changelog
kba Sep 2, 2024
0b0d419
:package: v3.0.0b4
kba Sep 2, 2024
a34beb8
OcrdMetsServer.add_file: pass on 'force' kwarg, too
bertsky Sep 2, 2024
dfa715d
test_mets_server: add test for force (overwrite)
bertsky Sep 2, 2024
9a8c41d
test_processor: add test for force (overwrite) w/ METS Server
bertsky Sep 2, 2024
65ab63c
add typing, extend docs
kba Aug 26, 2024
73a395e
Processor.verify: revert 5819c816 (we still have no defaults in json …
bertsky Sep 5, 2024
3382ad9
Processor.process_page_file / OcrdPageResultImage: allow None instead…
bertsky Sep 5, 2024
cad4777
PcGts.Page.id / make_xml_id: replace '/' with '_'
bertsky Sep 13, 2024
10b2abc
ocrd.cli.ocrd-tool resolve-resource: fix (forgot to print result)
bertsky Sep 12, 2024
bd64444
processor CLI: delegate --resolve-resource, too
bertsky Sep 13, 2024
71e9841
METS Server: also export+delegate physical_pages
bertsky Sep 15, 2024
01ccdf1
ocrd.cli.workspace: consistently pass on --mets-server-url and --back…
bertsky Sep 13, 2024
3301f9c
ocrd.cli.workspace server: add 'reload' and 'save'
bertsky Sep 13, 2024
dc2c758
ocrd.cli.bashlib input-files: pass on --mets-server-url, too
bertsky Sep 12, 2024
42af6a3
ocrd.cli.validate tasks: pass on --mets-server-url, too
bertsky Sep 12, 2024
7ea8d57
Processor.process_workspace(): do not show NotImplementedError contex…
bertsky Sep 12, 2024
9751256
Processor.verify: check output fileGrps as well (or OCRD_EXISTING_OUT…
bertsky Sep 12, 2024
f66753a
run_processor: be robust if ocrd_tool is missing steps
bertsky Sep 12, 2024
eb12a80
lib.bash: fix errexit
bertsky Sep 12, 2024
3355ea4
lib.bash input-files: pass on --mets-server-url, --overwrite, and par…
bertsky Sep 12, 2024
f05f840
lib.bash input-files: do not try to validate tasks here (impossible t…
bertsky Sep 12, 2024
b5c1191
Processor / Workspace.add_file: always force if config.OCRD_EXISTING_…
bertsky Sep 12, 2024
cbe465a
test processors: no need for 'force' kwarg anymore
bertsky Sep 13, 2024
3e214ca
tests: make sure ocrd_utils.config gets reset whenever changing it gl…
bertsky Sep 13, 2024
c549c42
OcrdPage: add PageType.get_ReadingOrderGroups()
bertsky Sep 7, 2024
53b880f
update OcrdPage from generateds
bertsky Sep 7, 2024
687b06f
:package: v3.0.0b5
kba Sep 16, 2024
a43098e
:memo: improve b5 changelog
bertsky Sep 16, 2024
d2cb0fb
ocrd.cli.workspace: assert non-server in cmds mutating METS
bertsky Sep 16, 2024
f678dca
OcrdMets.get_physical_pages: cover return_divs w/o for_fileIds for_pa…
bertsky Sep 27, 2024
9064db0
ocrd.cli.workspace: use physical_pages if possible, fix default outpu…
bertsky Sep 27, 2024
9530fcd
Processor.process_page_file: avoid process_page_pcgts() if OCRD_EXIST…
bertsky Sep 27, 2024
31a8474
ocrd_utils.initLogging: also add handler to root logger (to be consis…
bertsky Oct 9, 2024
d7049b1
CLI decorator: only import ocrd_network when needed
bertsky Oct 10, 2024
a9d49c1
Processor w/ OCRD_MAX_PARALLEL_PAGES: ThreadPoolExecutor→ProcessPoolE…
bertsky Oct 10, 2024
588c91d
Processor.process_workspace: apply timeout on process_page_file worke…
bertsky Oct 17, 2024
d126bdc
Processor w/ OCRD_MAX_PARALLEL_PAGES: concurrent.futures→loky
bertsky Oct 17, 2024
afa7f30
Processor w/o OCRD_MAX_PARALLEL_PAGES: dummy instead of executor
bertsky Oct 19, 2024
5821701
ocrd.process.profile logger: account for subprocess CPU time, too
bertsky Oct 19, 2024
53b1854
Processor.process_workspace: improve reporting, raise early if too ma…
bertsky Oct 21, 2024
4d66e37
Processor: refactor process_workspace into overridable subfuncs
bertsky Oct 23, 2024
71d6d49
Processor.process_workspace_handle_page_task: do not handler sigint
bertsky Oct 30, 2024
d2d5290
Processor.process_workspace_handle_tasks: log nr of ignored exception…
bertsky Oct 30, 2024
7932a6a
Merge pull request #23 from bertsky/new-processor-api-process-worker
bertsky Oct 30, 2024
7d1503e
:package: v3.0.0b6
bertsky Oct 30, 2024
08a631c
tests: prevent side effects from ocrd_logging
bertsky Nov 7, 2024
f3e423a
initLogging: do not remove any previous handlers/levels
bertsky Nov 7, 2024
3143518
initLogging: only add root handler instead of multiple redundant hand…
bertsky Nov 7, 2024
27323c6
disableLogging: remove all handlers, reset all levels
bertsky Nov 7, 2024
eb3120d
setOverrideLogLevel: override all currently active loggers' level
bertsky Nov 7, 2024
0186c53
logging: increase default root (not ocrd) level from INFO to WARNING
bertsky Nov 7, 2024
5ba2720
Processor: update max_workers docstring
bertsky Nov 7, 2024
f8f71d8
initLogging: call disableLogging if already initialized and force_reinit
bertsky Nov 11, 2024
5f2f602
Processor: replace weakref with __del__ to trigger shutdown
bertsky Nov 11, 2024
0446b82
Processor parallel pages: log via QueueHandler in subprocess, QueueLi…
bertsky Nov 11, 2024
53c4c18
:package: v3.0.0b7
bertsky Nov 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions .pylintrc
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
[MASTER]
extension-pkg-whitelist=lxml
ignored-modules=cv2,tesserocr,ocrd.model
extension-pkg-whitelist=lxml,pydantic
ignored-modules=cv2,tesserocr,ocrd_models.ocrd_page_generateds
ignore-paths=ocrd_page_generateds.py
ignore-patterns=.*generateds.*

[MESSAGES CONTROL]
ignore-patterns='.*generateds.*'
disable =
fixme,
E501,
line-too-long,
consider-using-f-string,
logging-fstring-interpolation,
trailing-whitespace,
logging-not-lazy,
inconsistent-return-statements,
disallowed-name,
invalid-name,
line-too-long,
missing-docstring,
no-self-use,
wrong-import-order,
too-many-nested-blocks,
superfluous-parens,
Expand All @@ -25,13 +28,9 @@ disable =
ungrouped-imports,
useless-object-inheritance,
useless-import-alias,
bad-continuation,
no-else-return,
logging-not-lazy

[FORMAT]
no-space-check=empty-line

[DESIGN]
# Maximum number of arguments for function / method
max-args=12
Expand All @@ -40,7 +39,7 @@ max-locals=30
# Maximum number of return / yield for function / method body
max-returns=12
# Maximum number of branch for function / method body
max-branchs=30
max-branches=30
# Maximum number of statements in function / method body
max-statements=60
# Maximum number of parents for a class (see R0901).
Expand Down
166 changes: 166 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,164 @@ Versioned according to [Semantic Versioning](http://semver.org/).

## Unreleased

## [3.0.0b7] - 2024-11-12

Fixed:
- `initLogging`: only add root handler instead of multiple redundant handlers with `propagate=false`
- `setOverrideLogLevel`: override all currently active loggers' level

Changed:
- :fire: logging: increase default root (not `ocrd`) level from `INFO` to `WARNING`
- :fire: `initLogging`: do not remove any previous handlers/levels, unless `force_reinit`
- :fire: `disableLogging`: remove all handlers, reset all levels - instead of being selective
- :fire: Processor: replace `weakref` with `__del__` to trigger `shutdown`
- :fire: `OCRD_MAX_PARALLEL_PAGES>1`: log via `QueueHandler` in subprocess, `QueueListener` in main

## [3.0.0b6] - 2024-10-30

Fixed:
- `OcrdMets.get_physical_pages`: cover `return_divs` w/o `for_fileIds` and `for_pageIds`

Changed:
- :fire: `ocrd_utils.initLogging`: also add handler to root logger (as in file config),
but disable message propagation to avoid duplication
- only import `ocrd_network` in `src/ocrd/decorators/__init__.py` once needed
- `Processor.process_page_file`: skip computing `process_page_pcgts` if output already exists,
but `OCRD_EXISTING_OUTPUT!=OVERWRITE`
- :fire: `OCRD_MAX_PARALLEL_PAGES>1`: switch from multithreading to multiprocessing, depend on
`loky` instead of stdlib `concurrent.futures`
- `OCRD_PROCESSING_PAGE_TIMEOUT>0`: actually enforce timeout within worker
- `OCRD_MAX_MISSING_OUTPUTS>0`: abort early if too many failures already, prospectively
- `Processor.process_workspace`: split up into overridable sub-methods:
- `process_workspace_submit_tasks` (iterate input file group and schedule page tasks)
- `process_workspace_submit_page_task` (download input files and submit single page task)
- `process_workspace_handle_tasks` (monitor page tasks and aggregate results)
- `process_workspace_handle_page_task` (await single page task and handle errors)


## [3.0.0b5] - 2024-09-16

Fixed:
- tests: ensure `ocrd_utils.config` gets reset whenever changing it globally
- `OcrdMetsServer.add_file`: pass on `force` kwarg
- `ocrd.cli.workspace`: consistently pass on `--mets-server-url` and `--backup`
- `ocrd.cli.validate "tasks"`: pass on `--mets-server-url`
- `ocrd.cli.bashlib "input-files"`: pass on `--mets-server-url`
- `lib.bash input-files`: pass on `--mets-server-url`, `--overwrite`, and parameters
- `lib.bash`: fix `errexit` handling
- `ocrd.cli.ocrd-tool "resolve-resource"`: forgot to actually print result

Changed:
- :fire: `Processor` / `Workspace.add_file`: always `force` if `OCRD_EXISTING_OUTPUT==OVERWRITE`
- :fire: `Processor.verify`: revert 3.0.0b1 enforcing cardinality checks (stay backwards compatible)
- :fire: `Processor.verify`: check output fileGrps, too
(must not exist unless `OCRD_EXISTING_OUTPUT=OVERWRITE|SKIP` or disjoint `--page-id` range)
- lib.bash `input-files`: do not try to validate tasks here (now covered by `Processor.verify()`)
- `run_processor`: be robust if `ocrd_tool` is missing `steps`
- `PcGtsType.PageType.id` via `make_xml_id`: replace `/` with `_`

Added:
- `OcrdPage`: new `PageType.get_ReadingOrderGroups()` to retrieve recursive RO as dict
- ocrd.cli.workspace `server`: add subcommands `reload` and `save`
- METS Server: export and delegate `physical_pages`
- processor CLI: delegate `--resolve-resource`, too
- `Processor.process_page_file` / `OcrdPageResultImage`: allow `None` besides `AlternativeImageType`

## [3.0.0b4] - 2024-09-02

Fixed:

* `Processor.metadata_location`: `src` workaround respects namespace packages, qurator-spk/eynollah#134
* `Workspace.reload_mets`: handle ClientSideOcrdMets as well

## [3.0.0b3] - 2024-08-30

Added:

* `OcrdConfig.reset_defaults` to reset config variables to their defaults

## [3.0.0b2] - 2024-08-30

Added:
- `Processor.max_workers`: class attribute to control per-page parallelism of this implementation
- `Processor.max_page_seconds`: class attribute to control per-page timeout of this implementation
- `OCRD_MAX_PARALLEL_PAGES` for whether and how many workers should process pages in parallel
- `OCRD_PROCESSING_PAGE_TIMEOUT` for whether and how long processors should wait for single pages
- `OCRD_MAX_MISSING_OUTPUTS` for maximum rate (fraction) of pages before making `OCRD_MISSING_OUTPUT=abort`

Fixed:
- `disableLogging`: also re-instate root logger to Python defaults

## [3.0.0b1] - 2024-08-26

Fixed:
- actually apply CLI `--log-filename`, and show in `--help`
- adapt to Pillow changes
- `ocrd workspace clone`: do pass on `--file-grp` (for download filtering)

Changed:
- :fire: `ocrd_utils`, `ocrd_models`, `ocrd_modelfactory`, `ocrd_validators` and `ocrd_network` are not published
as separate packages anymore, everything is contained in `ocrd` - you should adapt your `requirements.txt` accordingly
- :fire: `Processor.parameter` now a property (attribute always exists, but `None` for non-processing contexts)
- :fire: `Processor.parameter` is now a `frozendict` (contents immutable)
- :fire: `Processor.parameter` validate when(ever) set instead of (just) the constructor
- setting `Processor.parameter` will also trigger (`Processor.shutdown() and) `Processor.setup()`
- `get_processor(... instance_caching=True)`: use `min(max_instances, OCRD_MAX_PROCESSOR_CACHE)`
- :fire: `Processor.verify` always validates fileGrp cardinalities (because we have `ocrd-tool.json` defaults now)
- :fire: `OcrdMets.add_agent` without positional arguments
- `ocrd bashlib input-files` now uses normal Processor decorator, and gets passed actual `ocrd-tool.json` and tool name
from bashlib's `ocrd__wrap`

Added:
- `Processor.metadata_filename`: expose to make local path of `ocrd-tool.json` in Python distribution reusable+overridable
- `Processor.metadata_location`: expose to make absolute path of `ocrd-tool.json` reusable+overridable
- `Processor.metadata_rawdict`: expose to make in-memory contents of `ocrd-tool.json` reusable+overridable
- `Processor.metadata`: expose to make validated and default-expanded contents of `ocrd-tool.json` reusable+overridable
- `Processor.shutdown`: to shut down processor after processing, optional
- `Processor.max_instances`: class attribute to control instance caching of this implementation

## [3.0.0a2] - 2024-08-22

Changed:
- :fire: `OcrdPage` as proxy of `PcGtsType` instead of alias; also contains `etree` and `mapping` now
- :fire: `page_from_file`: removed kwarg `with_tree` - use `OcrdPage.etree` and `OcrdPage.mapping` instead
- :fire: `Processor.zip_input_files` now can throw `ocrd.NonUniqueInputFile` and `ocrd.MissingInputFile`
(the latter only if `OCRD_MISSING_INPUT=ABORT`)
- :fire: `Processor.zip_input_files` does not by default use `require_first` anymore
(so the first file in any input file tuple per page can be `None` as well)
- :fire: no more `Workspace.overwrite_mode`, merely delegate to `OCRD_EXISTING_OUTPUT=OVERWRITE`
- :art: improve on docs result for `ocrd_utils.config`

Added:
- :point_right: `OCRD_DOWNLOAD_INPUT` for whether input files should be downloaded before processing
- :point_right: `OCRD_MISSING_INPUT` for how to handle missing input files (**`SKIP`** or `ABORT`)
- :point_right: `OCRD_MISSING_OUTPUT` for how to handle processing failures (**`SKIP`** or `ABORT` or `COPY`)
the latter behaves like ocrd-dummy for the failed page(s)
- :point_right: `OCRD_EXISTING_OUTPUT` for how to handle existing output files (**`SKIP`** or `ABORT` or `OVERWRITE`)
- new CLI option `--debug` as short-hand for `ABORT` choices above
- `Processor.logger` set up by constructor already (for re-use by processor implementors)
- `default`-expand and validate `ocrd_tool.json` in `Processor` constructor, log invalidities
- handle JSON `deprecation` in `ocrd_tool.json` by reporting warnings

## [3.0.0a1] - 2024-08-15

Changed:
- :fire: Deprecate `Processor.process`
- update spec to v3.25.0, which requires annotating fileGrp cardinality in `ocrd-tool.json`
- :fire: Remove passing non-processing kwargs to `Processor` constructor, add as members
(i.e. `show_help`, `dump_json`, `dump_module_dir`, `list_resources`, `show_resource`, `resolve_resource`)
- :fire: Deprecate passing processing arg / kwargs to `Processor` constructor
(i.e. `workspace`, `page_id`, `input_file_grp`, `output_file_grp`; now all set by `run_processor`)
- :fire: Deprecate passing `ocrd-tool.json` metadata to `Processor` constructor
- `ocrd.processor`: Handle loading of bundled `ocrd-tool.json` generically

Added:
- `Processor.process_workspace`: process a complete workspace, with default implementation
- `Processor.process_page_file`: process an OcrdFile, with default implementation
- `Processor.process_page_pcgts`: process a single OcrdPage, produce a single OcrdPage, required to implement
- `Processor.verify`: handle fileGrp cardinality verification, with default implementation
- `Processor.setup`: to set up processor before processing, optional

## [2.68.0] - 2024-08-23

Changed:
Expand Down Expand Up @@ -2164,6 +2322,14 @@ Fixed
Initial Release

<!-- link-labels -->
[3.0.0b6]: ../../compare/v3.0.0b6..v3.0.0b5
[3.0.0b5]: ../../compare/v3.0.0b5..v3.0.0b4
[3.0.0b4]: ../../compare/v3.0.0b4..v3.0.0b3
[3.0.0b3]: ../../compare/v3.0.0b3..v3.0.0b2
[3.0.0b2]: ../../compare/v3.0.0b2..v3.0.0b1
[3.0.0b1]: ../../compare/v3.0.0b1..v3.0.0a2
[3.0.0a2]: ../../compare/v3.0.0a2..v3.0.0a1
[3.0.0a1]: ../../compare/v3.0.0a1..v2.67.2
[2.68.0]: ../../compare/v2.68.0..v2.67.2
[2.67.2]: ../../compare/v2.67.2..v2.67.1
[2.67.1]: ../../compare/v2.67.1..v2.67.0
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ FROM ocrd_core_base as ocrd_core_test
ARG SKIP_ASSETS
WORKDIR /build/core
COPY Makefile .
COPY .gitmodules .
RUN if test -z "$SKIP_ASSETS" || test $SKIP_ASSETS -eq 0 ; then make assets ; fi
COPY tests ./tests
COPY .gitmodules .
COPY requirements_test.txt .
RUN pip install -r requirements_test.txt
RUN mkdir /ocrd-data && chmod 777 /ocrd-data
Expand Down
46 changes: 4 additions & 42 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -238,9 +238,9 @@ repo/assets repo/spec: always-update

.PHONY: spec
# Copy JSON Schema, OpenAPI from OCR-D/spec
spec: repo/spec
cp repo/spec/ocrd_tool.schema.yml ocrd_validators/ocrd_validators/ocrd_tool.schema.yml
cp repo/spec/bagit-profile.yml ocrd_validators/ocrd_validators/bagit-profile.yml
spec: # repo/spec
cp repo/spec/ocrd_tool.schema.yml src/ocrd_validators/ocrd_tool.schema.yml
cp repo/spec/bagit-profile.yml src/ocrd_validators/bagit-profile.yml

#
# Assets
Expand Down Expand Up @@ -273,7 +273,7 @@ test-logging: assets
cp src/ocrd_utils/ocrd_logging.conf $$tempdir; \
cd $$tempdir; \
$(PYTHON) -m pytest --continue-on-collection-errors -k TestLogging -k TestDecorators $(TESTDIR); \
rm -r $$tempdir/ocrd_logging.conf $$tempdir/.benchmarks; \
rm -r $$tempdir/ocrd_logging.conf $$tempdir/ocrd.log $$tempdir/.benchmarks; \
rm -rf $$tempdir/.coverage; \
rmdir $$tempdir

Expand Down Expand Up @@ -401,41 +401,3 @@ docker docker-cuda docker-cuda-tf1 docker-cuda-tf2 docker-cuda-torch:
# Build wheels and source dist and twine upload them
pypi: build
twine upload --verbose dist/ocrd-$(VERSION)*{tar.gz,whl}

pypi-workaround: build-workaround
for dist in $(BUILD_ORDER);do twine upload dist/$$dist-$(VERSION)*{tar.gz,whl};done

# Only in place until v3 so we don't break existing installations
build-workaround: pyclean
cp pyproject.toml pyproject.toml.BAK
cp src/ocrd_utils/constants.py src/ocrd_utils/constants.py.BAK
cp src/ocrd/cli/__init__.py src/ocrd/cli/__init__.py.BAK
for dist in $(BUILD_ORDER);do \
cat pyproject.toml.BAK | sed "s,^name =.*,name = \"$$dist\"," > pyproject.toml; \
cat src/ocrd_utils/constants.py.BAK | sed "s,dist_version('ocrd'),dist_version('$$dist')," > src/ocrd_utils/constants.py; \
cat src/ocrd/cli/__init__.py.BAK | sed "s,package_name='ocrd',package_name='$$dist'," > src/ocrd/cli/__init__.py; \
$(MAKE) build; \
done
rm pyproject.toml.BAK
rm src/ocrd_utils/constants.py.BAK
rm src/ocrd/cli/__init__.py.BAK

# test that the aliased packages work in isolation and combined
test-workaround: build-workaround
$(MAKE) uninstall-workaround
for dist in $(BUILD_ORDER);do \
pip install dist/$$dist-*.whl ;\
ocrd --version ;\
make test ;\
pip uninstall --yes $$dist ;\
done
for dist in $(BUILD_ORDER);do \
pip install dist/$$dist-*.whl ;\
done
ocrd --version ;\
make test ;\
for dist in $(BUILD_ORDER);do pip uninstall --yes $$dist;done

uninstall-workaround:
for dist in $(BUILD_ORDER);do $(PIP) uninstall --yes $$dist;done

40 changes: 31 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,12 @@ complete stack of OCR-D-related software.

The easiest way to install is via `pip`:

```sh
pip install ocrd
pip install ocrd

# or just the functionality you need, e.g.

pip install ocrd_modelfactory
```

All Python software released by [OCR-D](https://github.com/OCR-D) requires Python 3.8 or higher.

**NOTE** Some OCR-D-Tools (or even test cases) _might_ reveal an unintended behavior if you have specific environment modifications, like:
> **NOTE** Some OCR-D tools (or even test cases) _might_ reveal an unintended behavior if you have specific environment modifications, like:
* using a custom build of [ImageMagick](https://github.com/ImageMagick/ImageMagick), whose format delegates are different from what OCR-D supposes
* custom Python logging configurations in your personal account

Expand All @@ -82,7 +77,6 @@ Almost all behaviour of the OCR-D/core software is configured via CLI options an

Some parts of the software are configured via environment variables:

* `OCRD_METS_CACHING`: If set to `true`, access to the METS file is cached, speeding in-memory search and modification.
* `OCRD_PROFILE`: This variable configures the built-in CPU and memory profiling. If empty, no profiling is done. Otherwise expected to contain any of the following tokens:
* `CPU`: Enable CPU profiling of processor runs
* `RSS`: Enable RSS memory profiling
Expand All @@ -95,18 +89,46 @@ Some parts of the software are configured via environment variables:
* `XDG_CONFIG_HOME`: Directory to look for `./ocrd/resources.yml` (i.e. `ocrd resmgr` user database) – defaults to `$HOME/.config`.
* `XDG_DATA_HOME`: Directory to look for `./ocrd-resources/*` (i.e. `ocrd resmgr` data location) – defaults to `$HOME/.local/share`.

* `OCRD_DOWNLOAD_RETRIES`: Number of times to retry failed attempts for downloads of workspace files.
* `OCRD_DOWNLOAD_RETRIES`: Number of times to retry failed attempts for downloads of resources or workspace files.
* `OCRD_DOWNLOAD_TIMEOUT`: Timeout in seconds for connecting or reading (comma-separated) when downloading.

* `OCRD_MISSING_INPUT`: How to deal with missing input files (for some fileGrp/pageId) during processing:
* `SKIP`: ignore and proceed with next page's input
* `ABORT`: throw `MissingInputFile` exception

* `OCRD_MISSING_OUTPUT`: How to deal with missing output files (for some fileGrp/pageId) during processing:
* `SKIP`: ignore and proceed processing next page
* `COPY`: fall back to copying input PAGE to output fileGrp for page
* `ABORT`: re-throw whatever caused processing to fail

* `OCRD_MAX_MISSING_OUTPUTS`: Maximal rate of skipped/fallback pages among all processed pages before aborting (decimal fraction, ignored if negative).

* `OCRD_EXISTING_OUTPUT`: How to deal with already existing output files (for some fileGrp/pageId) during processing:
* `SKIP`: ignore and proceed processing next page
* `OVERWRITE`: force writing result to output fileGrp for page
* `ABORT`: re-throw `FileExistsError` exception


* `OCRD_METS_CACHING`: Whether to enable in-memory storage of OcrdMets data structures for speedup during processing or workspace operations.

* `OCRD_MAX_PROCESSOR_CACHE`: Maximum number of processor instances (for each set of parameters) to be kept in memory (including loaded models) for processing workers or processor servers.

* `OCRD_MAX_PARALLEL_PAGES`: Maximum number of processor threads for page-parallel processing (within each Processor's selected page range, independent of the number of Processing Workers or Processor Servers). If set `>1`, then a METS Server must be used for METS synchronisation.

* `OCRD_PROCESSING_PAGE_TIMEOUT`: Timeout in seconds for processing a single page. If set >0, when exceeded, the same as OCRD_MISSING_OUTPUT applies.

* `OCRD_NETWORK_SERVER_ADDR_PROCESSING`: Default address of Processing Server to connect to (for `ocrd network client processing`).
* `OCRD_NETWORK_SERVER_ADDR_WORKFLOW`: Default address of Workflow Server to connect to (for `ocrd network client workflow`).
* `OCRD_NETWORK_SERVER_ADDR_WORKSPACE`: Default address of Workspace Server to connect to (for `ocrd network client workspace`).
* `OCRD_NETWORK_RABBITMQ_CLIENT_CONNECT_ATTEMPTS`: Number of attempts for a worker to create its queue. Helpful if the rabbitmq-server needs time to be fully started.

* `OCRD_NETWORK_CLIENT_POLLING_SLEEP`: How many seconds to sleep before trying `ocrd network client` again.
* `OCRD_NETWORK_CLIENT_POLLING_TIMEOUT`: Timeout for a blocking `ocrd network client` (in seconds).

* `OCRD_NETWORK_SOCKETS_ROOT_DIR`: The root directory where all mets server related socket files are created.
* `OCRD_NETWORK_LOGS_ROOT_DIR`: The root directory where all ocrd_network related file logs are stored.



## Packages

Expand Down
Loading