Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robustness towards pipeline.config change of an inited pipeline. #46

Open
varisd opened this issue Aug 12, 2024 · 2 comments
Open
Labels
feature New feature or request help wanted Extra attention is needed question Further information is requested

Comments

@varisd
Copy link
Contributor

varisd commented Aug 12, 2024

Right now, adjusting pipeline.config (generated during pipeline init) does not have any effect on the existing pipeline (serves only an informational purpose).

A feasible feature would be a support of editing pipeline.config. This would require:

  • identifying steps which have step.parameters different from the pipeline.config definition.
  • identifying whether the current directory structure (and individual step.dependencies) reflect the pipeline graph defined in pipeline.config
  • checking for correctness of the edited pipeline.config

What should be the default response when "inconsinstencies" are detected? Do we:

  • only report the inconsistencies (and leave the new initialization to the user)?
  • reinitialize "in a dumb way", i.e. deleting the pipeline directory step subdirs and reinitialize?
  • implement some smart reinitialization of only the parts that were changed?
    Personally, I suggest either 1) or 2) since the initialization process is fairly quick so it does not have to be super optimized.
@varisd varisd added feature New feature or request help wanted Extra attention is needed question Further information is requested labels Aug 12, 2024
@bhaddow
Copy link
Contributor

bhaddow commented Aug 12, 2024

The third option would be the behaviour I would expect of a tool like this (eg make, snakemake). If a step has changed, then it should be re-run. If other steps are dependent on this (according to the DAG) then they should be rerun. If I fix the evaluation script, I really don't want the pipeline to re-run training.

Options 1 and 2 may even make the situation slightly worse. At the moment, if I make a change to the pipeline config, I have to delete and rerun. If I make a trivial change that I know will not affect the rerunning, then I can avoid the delete. If we implement options 1 or 2 then the system may force me to delete and rerun (even though I know that delete is not needed).

Connected with this issue, why do we have init and run as separate commands? I would think that run and dry-run would make more sense. dry-run is the same as run, except that it does not execute anything.

@varisd
Copy link
Contributor Author

varisd commented Aug 12, 2024

You make a good point about the rerunning of modified steps. In practice, codewise, it should not be more complicated, only inefficient part would be reinitializing of steps that have the change step as dependency (the graph has only unidirectional pointers at the moment). The usual graphs should usually be rather small so that should not be a big issue.

Regarding changing "init" to "dry-run" and having "run" command to do both initialization and running: I support this change since right now there is really no practical reason to "execute" the pipeline in two steps (init and run). The CLI mainly copied the internal implementation, where separating init and run kinda makes sense.
I propose opening another issue regarding this change - it only requires some adjustments to the opuspocus_cli/ scripts (rest of the codebase can and should stay the same).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants