-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NeMo-UX] Support load_strictness
#10612
Conversation
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PL does this like this. So I feel like we should adopt the same design and not add a custom arg to our strategy.
If only we can propagate this flag to dist_checkpointing.load then using this flag would be ideal |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
Revisiting this PR. I don't think this is quite what we want. It looks like PTL's |
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
…t-load-strictness
…ry workaround Signed-off-by: ashors1 <[email protected]>
nemo/lightning/_strategy_lib.py
Outdated
@@ -516,6 +516,19 @@ def load_model_state_dict(megatron_parallel, checkpoint: Mapping[str, Any], stri | |||
from megatron.core import parallel_state | |||
from megatron.core.dist_checkpointing.validation import StrictHandling, parse_strict_flag | |||
|
|||
## convert from StrictHandling to bool for PTL | |||
if os.environ.get("MCORE_STRICT_HANDLING") is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid such logic, it will be terrible to debug later on.
What's the reason this can't be passed by argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous approach was using PTL's strict_loading
to control mcore load strictness, but that required overwriting PTL's getter and setter because we want to allow strict_loading
to be a string, while PTL only allows bool. @marcromeyn was opposed to overwriting the getter and setter. He is working on a separate PR that should make it easier to control load_strictness
. This PR is intended as a stopgap solution until that PR is in.
@marcromeyn do you have any comments on the current approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if it's set in a global var as opposed to a environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @marcromeyn offline and we decided it would be best to pass the variable as an argument into MegatronStrategy
for now. Please take a look at the latest changes and let me know what you think
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
Signed-off-by: ashors1 <[email protected]>
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Thank you for improving NeMo's documentation! |
1 similar comment
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Thank you for improving NeMo's documentation! |
[🤖]: Hi @ashors1 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: llm
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information