-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node Multi card MPIJob #1702
base: main
Are you sure you want to change the base?
Conversation
examples/kubernetes/ci/multi-node-multi-card-lora-clm-values.yaml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@regisss could you help review this PR?
limits: | ||
# -- Specify the number of Gaudi card(s) | ||
cpu: 16 | ||
habana.ai/gaudi: 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested and validated to run on < 8 cards on multiple nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ltran5991 it has been tested on 2 nodes with one card each
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bout 2 nodes with 2 cards each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How bout 2 nodes with 2 cards each?
@sramakintel ,
Could you test with 2node/2cards and confirm the code works. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this test been validated to run <8 cards on multiple systems
What does this PR do?
This PR adds support for multi-node and multi-card fine tuning for Intel Gaudi devices using the MPI Operator and helm charts
Fixes # (issue)
Before submitting