Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node Multi card MPIJob #1702

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

sramakintel
Copy link

@sramakintel sramakintel commented Jan 17, 2025

What does this PR do?

This PR adds support for multi-node and multi-card fine tuning for Intel Gaudi devices using the MPI Operator and helm charts

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@sramakintel sramakintel requested a review from regisss as a code owner January 17, 2025 18:35
Copy link

@gera-aldama gera-aldama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review comments.

Copy link

@gera-aldama gera-aldama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Author

@sramakintel sramakintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regisss could you help review this PR?

limits:
# -- Specify the number of Gaudi card(s)
cpu: 16
habana.ai/gaudi: 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been tested and validated to run on < 8 cards on multiple nodes?

Copy link
Author

@sramakintel sramakintel Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ltran5991 it has been tested on 2 nodes with one card each

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bout 2 nodes with 2 cards each?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bout 2 nodes with 2 cards each?

@sramakintel ,
Could you test with 2node/2cards and confirm the code works. Thanks.

Copy link

@ltran5991 ltran5991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this test been validated to run <8 cards on multiple systems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants