Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA:GPU] Add support for NCCL ncclCommInitRankScalable API #21273

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nvcastet
Copy link

ncclCommInitRankScalable enables the initialization of communicators via multiple roots which improves the init performance at large scale.
The maximum number of ranks associated with a root rank to initialize a NCCL communicator can be tuned via --xla_gpu_nccl_init_max_rank_per_root_ratio. Default is 128 ranks per root.

@nvcastet nvcastet force-pushed the ncclCommInitRankScalable branch from 40286ab to d2fb81e Compare January 10, 2025 18:11
@nvcastet
Copy link
Author

CC @ezhulenev

@nvcastet nvcastet force-pushed the ncclCommInitRankScalable branch 3 times, most recently from 8a46f76 to 86e50af Compare January 10, 2025 18:23
@nvcastet nvcastet force-pushed the ncclCommInitRankScalable branch from 86e50af to 98ef02d Compare January 10, 2025 18:27
// Returns true if this clique is a subset of `other`: both cliques have the
// same `stream_id` and all clique devices are part of `other` clique.
bool IsSubsetOf(const CliqueKey& other) const final;

// Returns a copy of the key (subkey) with the root device properly set given
// nroots and root_seq_id. The subkey is used to generate a NcclCliqueId.
Copy link
Member

@ezhulenev ezhulenev Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that a bit more details would be helpful, I don't have the context, but the prorblem is that after reading this documentation I still don't have any idea what's going on :) (cc @frgossen)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvcastet nvcastet force-pushed the ncclCommInitRankScalable branch 3 times, most recently from 06fcca9 to bb14e02 Compare January 10, 2025 20:49
@nvcastet nvcastet force-pushed the ncclCommInitRankScalable branch from bb14e02 to f146a48 Compare January 10, 2025 20:55
@nvcastet nvcastet requested a review from ezhulenev January 10, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants