Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test trial resumability with PBT & Hyperband #20

Open
Delaunay opened this issue Nov 9, 2022 · 15 comments
Open

Test trial resumability with PBT & Hyperband #20

Delaunay opened this issue Nov 9, 2022 · 15 comments

Comments

@Delaunay
Copy link
Collaborator

Delaunay commented Nov 9, 2022

No description provided.

@MaximilienLC
Copy link

Hey, thanks for the great package!

I was wondering if you had any update on this issue. Is it supposedly currently possible to resume trials however this feature has not yet been properly tested?

@bouthilx
Copy link
Member

bouthilx commented Dec 4, 2023

Hi! PBT is a bit tricky to use with Hydra because it relies on checkpoints being copied from one trial to another while Hydra creates new working dir for each trial and sets them as working directories for the time of the trial execution. It should be possible to use PBT (and Hyperband with checkpointing) if you set your working dir explicitly in hydra config.

@Delaunay Is this something you tested yet?

@Delaunay
Copy link
Collaborator Author

Delaunay commented Dec 5, 2023

As Bouthilx pointed out you need to control the directory so the checkpoint can be found between reruns.

Maybe something like this would work:

hydra:
  sweep:
    dir: multirun/
    subdir: ${hydra.sweeper.experiment.name}/${hydra.sweeper.experiment.paramhash}

So all the HPO run will end up in the same folder.
It will create one folder per experiment name and one folder per trial parameter config.
So it should be able to find the checkpoint of a given trial.

@bouthilx
Copy link
Member

bouthilx commented Dec 7, 2023

Actually, this would work for ASHA/Hyperband but not for PBT. When using PBT, the trial working directory which corresponds to ${trial.working_dir} is copied from the parent's trial to the current child trial. @Delaunay Do we have support for trial.working_dir in this plugin?

@Delaunay
Copy link
Collaborator Author

Delaunay commented Dec 7, 2023

In the case of hydra, shouldn't trial.working_dir be the current working directory that hydra set ?

@bouthilx
Copy link
Member

bouthilx commented Dec 7, 2023

No, it's determined based on the experiment's working dir: https://github.com/Epistimio/orion/blob/develop/src/orion/core/worker/trial.py#L353

@Delaunay
Copy link
Collaborator Author

Delaunay commented Dec 7, 2023

It can be easily added #35

@Delaunay
Copy link
Collaborator Author

Delaunay commented Dec 7, 2023

@bouthilx
With the latest version, I believe this should work for PBT

hydra:
  sweep:
    dir: multirun/${hydra.sweeper.experiment.name}
    subdir: ${hydra.sweeper.experiment.trial_working_dir}/

@MaximilienLC
Copy link

MaximilienLC commented Feb 10, 2024

Hey sorry for the late reply, I tried making it work w/ this simple example:

defaults:
  - override hydra/sweeper: orion

hydra:
  sweep:
    dir: multirun/${hydra.sweeper.experiment.name}
    subdir: ${hydra.sweeper.experiment.trial_working_dir}
  sweeper:
    params:
      x: "uniform(-10, 10)"
      epoch: "fidelity(low=1, high=2, base=1)"
    algorithm:
      type: pbt
      config:
        seed: 0
        population_size: 5
        generations: 1
x: 0
epoch: 0
import hydra
from omegaconf import DictConfig


@hydra.main(config_path=".", config_name="config")
def main(cfg: DictConfig) -> float:
    result = (cfg.x * cfg.x) ** cfg.epoch
    with open(f"{cfg.x}+{int(cfg.epoch)}.txt", "w") as f:
        f.write(str(result))
    return result


if __name__ == "__main__":
    main()

However, every trial's trial_working_dir is different (and equal to trial)
Example output hydra.yaml

...
trial: 4e2287fd0fedb2f7da85735f1599eff5
paramhash: b135edc909ac21e8304df5ca1bd363c5
uuid: 88208312c86311eeb13b0242ac110002
trial_working_dir: 4e2287fd0fedb2f7da85735f1599eff5
...

paramhash looks the same for trials that do not change parameters though.

@bouthilx
Copy link
Member

This is expected. What should be happening is that Oríon copies over the dir from the parent trial to the child one, so that if you have a checkpoint there it is available in the child trial directory (happening here https://github.com/Epistimio/orion/blob/develop/src/orion/client/runner.py#L191). Do you see an empty directory instead?

@Delaunay Is the hydra plugin using orion's Runner? If not then it probably does not call the function prepare_trial_working_dir that is responsible for this copy from parent trial to child trial dir.

@MaximilienLC
Copy link

Yeah empty with ${hydra.sweeper.experiment.trial_working_dir} but not with ${hydra.sweeper.experiment.paramhash}.

@Delaunay
Copy link
Collaborator Author

No, it does not call the runner since Hydra has its own launcher thing that launch workers

@MaximilienLC
Copy link

Alrighty, do y'all think it can be worked around?

@Delaunay
Copy link
Collaborator Author

We would need to implement the copy for the algo here right before the experiment is launched

@MaximilienLC
Copy link

Got it, based on your responses I'm guessing that's not on the timeline. I'll make a PR to add a warning on the README that PBT-like algorithms aren't functional for now.

This was referenced Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants