Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor parameters validator overwrites #1025

Open
MehmedGIT opened this issue Mar 26, 2023 · 3 comments
Open

Processor parameters validator overwrites #1025

MehmedGIT opened this issue Mar 26, 2023 · 3 comments

Comments

@MehmedGIT
Copy link
Contributor

The processor parameter validator overwrites the parameters of the passed object with defaults when missing. This seems to be the expected (from the written tests) and desired behavior to simplify the CLI. However, this is not ideal when only validation of the passed arguments is required. The simplest (but yet not obvious to me) solution would be to have a way to disable the default overwriting.

Context:
I was experimenting to add more validation to the OtoN converter (currently, very outdated in the repo itself) based on the core validators:

from ocrd_validators import ParameterValidator

def validate_processor_params_with_core(processor_args: ProcessorCallArguments):
    print(f"processor before: {processor_args}")
    print(f"params before: {processor_args.parameters}")
    report = ParameterValidator(processor_args.ocrd_tool_json).validate(processor_args.parameters)
    print(f"params after: {processor_args.parameters}")
    print(f"processor after: {processor_args}")
    if not report.is_valid:
        raise Exception(report.errors)
    return report

I got:

processor args before: ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN
params before: {}
params after: {'dpi': 0, 'grayscale': False, 'level-of-operation': 'page', 'maxskew': 0.0, 'method': 'ocropy', 'noise_maxsize': 0, 'threshold': 0.5}
processor args after: ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN -p '{"dpi": 0, "grayscale": false, "level-of-operation": "page", "maxskew": 0.0, "method": "ocropy", "noise_maxsize": 0, "threshold": 0.5}'

Since this is going to populate the produced Nextflow files with undesired outputs, I have to back up by deep copying the parameters and restoring them back after the validation (which is not a big deal and doable).

Processing Server #974 could also benefit from having just a validation step of the received input against the ocrd tool schema without having to increase the size of the OcrdProcessingMessage. Moreover, the Processing Server can right away prevent sending processing messages that will potentially fail inside the Processing Worker due to wrong processor parameters.

Another point, I am also using a simpler approach in the OtoN converter to validate processor parameters against their ocrd tool json without having to run or even have them installed on the environment by simply having the ocrd_all_tool.json file. IMO, this will be potentially also good for Processing Server #974! The drawback, however, is that the ocrd_all_tool.json may not always be the latest version available for all processors and have to be updated with every ocrd all release. IMO, this is still optimal in comparison to executing processors to get their tool json or even having them installed in the environment.

@bertsky
Copy link
Collaborator

bertsky commented Mar 26, 2023

Since this is going to populate the produced Nextflow files with undesired outputs, I have to back up by deep copying the parameters and restoring them back after the validation (which is not a big deal and doable).

Or you just make a shallow copy dict(processor_args.parameters) that you pass to the validator.

BTW, for workflow validation also have a look at TaskSequence.

Processing Server #974 could also benefit from having just a validation step of the received input against the ocrd tool schema without having to increase the size of the OcrdProcessingMessage. Moreover, the Processing Server can right away prevent sending processing messages that will potentially fail inside the Processing Worker due to wrong processor parameters.

I agree – the error should be reported as early as possible.

But keep in mind that sometimes only the processor instance itself really knows whether parameters do actually work. For example, there might be some semantic constraints (not schema backed) between parameters, or resource files will have to be (attempted to be) loaded. So early validation is good, but not sufficient – errors can still arise.

I don't see the problem with querying the tool JSON from the processors at runtime though. Caching such tool JSONs can indeed cause inconsistency when processors get re-installed during the server runtime. But that's nothing that should keep us from doing it that way IMO.

@MehmedGIT
Copy link
Contributor Author

Or you just make a shallow copy dict(processor_args.parameters) that you pass to the validator.

Yes, that's also possible and even more efficient.

BTW, for workflow validation also have a look at TaskSequence.

This is exactly where I looked as an entry point from the CLI of ocrd process before starting to think how to validate with core. However, I have one main issue with Task Sequence - it validates workflow data during run-time (which also requires access to the Workspace mets). The idea is to validate as much things as possible without having the workspace mets and executables in hand.

This said, there are of course drawbacks. Such as not being able to validate an input file group if it was not mentioned in a previous step as an output group. This is problematic for GT file groups or in general partial workflows.

But keep in mind that sometimes only the processor instance itself really knows whether parameters do actually work ... So early validation is good, but not sufficient – errors can still arise.

Sure, the idea is to cover as much as possible and fail early if possible - which may never be 100%. During run-time the processor validates the parameters against the tool json anyway.

I don't see the problem with querying the tool JSON from the processors at runtime though. Caching such tool JSONs can indeed cause inconsistency when processors get re-installed during the server runtime. But that's nothing that should keep us from doing it that way IMO.

Since we're not trying to cover 100%, this could still be a potential solution for the Processing Server.

@bertsky
Copy link
Collaborator

bertsky commented Mar 27, 2023

However, I have one main issue with Task Sequence - it validates workflow data during run-time (which also requires access to the Workspace mets). The idea is to validate as much things as possible without having the workspace mets and executables in hand.

Indeed, we should probably rewrite this class to support both modes, online (against concrete METS) and offline.

This said, there are of course drawbacks. Such as not being able to validate an input file group if it was not mentioned in a previous step as an output group. This is problematic for GT file groups or in general partial workflows.

True. But still, in offline mode, you can check for things like multiple steps trying to write to the same fileGrp, or the overall workflow reading from multiple input fileGrps (which should not be allowed, except in patial workflows).

BTW, in the future we might add more granular checks to workflow dependencies. For example, currently, processors have requirements on the kind of image features they consume. The same could be done for structural features (e.g. "needs TextLine", "needs @orientation"). Of course, we would first need a spec on how processors must describe what features they consume and produce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants