Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we remove the "title" parameter from experiment and only keep "id"? #9

Open
Kipok opened this issue Aug 26, 2024 · 4 comments
Open
Assignees

Comments

@Kipok
Copy link
Collaborator

Kipok commented Aug 26, 2024

Right now, the experiment accepts both title and id parameters which allows to group multiple jobs under the same "title". I would suggest changing this and only keep a single "id" parameter to simplify some workflows as well as make it easier to maintain. Here are some points why I think it will be more convenient:

  • It's already possible to have multiple experiments grouped under the same folder by changing the job_dir parameter of the Tunnel.
  • Currently the title and id are not used consistently across the codebase. E.g. Experiment supports specifying a custom id in init, but then here https://github.com/NVIDIA/NeMo-Run/blob/main/src/nemo_run/run/experiment.py#L223 it assumes a specific format. Also here https://github.com/NVIDIA/NeMo-Run/blob/main/src/nemo_run/run/experiment.py#L243 the id is simply not defined. While it's of course possible to fix all of this, it looks like maintaining both parameters just leads to unnecessary complexity.
  • It's actually making it harder to reuse the experiment "workspace" to store job artifacts. E.g. if I want to keep my checkpoints inside the exp folder and then be able to restart that training, I would have to know exactly what id was used (and it's being created with some random suffix, so it's hard to keep track of that). It would be much simpler for this workflow if the "id" was the experiment name and if I specify the same id, we should assume that I want to reuse that experiment workspace for my job. Maybe we can raise a warning by default to let users know that something potentially dangerous might happen and have some env var or parameter to disable the warning if experiment reuse is fairly common in some workflows (e.g. that's what we do often in our experiments).
@ShriyaPalsamudram
Copy link
Collaborator

I'm worried that dropping the unique id from the experiment could lead to overwriting old logs or the old state. If say we allow using the same directory without id for launching 2 different runs, and each of them use different containers/code - then there's no good way to find that out later.

For checkpoint resume, just using a separate shared folder for checkpoints outside of the experiment directory can achieve that.

@Kipok
Copy link
Collaborator Author

Kipok commented Aug 26, 2024

Right, but then we wouldn't be able to track the artifacts using nemo run. E.g. if I want to add some functionality to download the results or maybe to initialize another experiment with the current one, it's going to be harder to do than if I keep everything inside the workspace folder.

Another point to add is that if I'm running 2 experiments with the same name and it's indeed an error, I'd rather know about that and not have nemo-run silently create a new folder for me. Maybe I didn't actually mean to run that experiment? If you think that default behavior should be to error out, I'm fine with that as long as we add some parameter to allow for override, so that we can use it in our workflows. Would that be a good approach?

@Kipok
Copy link
Collaborator Author

Kipok commented Aug 27, 2024

Another option to consider is to make exp name optional, so that when it's not supplied something random like the current time string is used. This way we don't put restriction on users to always create a unique name if they just want to use nemo.run to launch jobs and don't care where exactly where metadata is stored. While also reusing the folder if experiment name is specified

@marcromeyn
Copy link
Collaborator

I think I would prefer @Kipok's suggestion to make the experiment-name optional (NeMo does the same). The experiment-dir should be self-contained and therefore people should be able to move it, so we should also have a Experiment.from_path classmethod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants