-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rdma: fix cq usage on different domains #700
rdma: fix cq usage on different domains #700
Conversation
When domain-per-thread feature is enabled, a domain is created for the data rail as well as for the control rail. When the control rail is initialized with it's own domain, endpoint of control rail from control domain is bound to cq of data domain. This commit change the code such that only one domain is used for data and control rail in case domain-per-thread feature is enabled. Signed-off-by: Michael Axtmann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and I've cherried this into my tree w/o issues.
nccl_net_ofi_rdma_plugin_t *plugin = rdma_endpoint_get_plugin(ep); | ||
|
||
if (plugin->base.domain_per_thread) { | ||
assert(ep_rail->cq == NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is cq expected to be NULL here? If ep_rail
is already initialized, shouldn't ep_rail->cq
be non-NULL, even in domain-per-thread mode?
|
||
/* Initialize libfabric resources of endpoint rails */ | ||
for (int rail_id = 0; rail_id != ep->num_rails; ++rail_id) { | ||
rail_dev = rdma_device_get_rail(device, rail_id); | ||
rail = rdma_endpoint_get_rail(ep, rail_id); | ||
|
||
ret = ep_rail_init(ep, dev_id, rail_id, rail_dev, rail); | ||
ret = get_domain_and_cq_for_endpoint_rail(ep, rail_dev, rail, &domain, &cq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing -- you're calling this function with rail
, which is uninitialized at this point (well, it will be all zeros only because we calloc the array..)
I would prefer a solution where we don't pass an uninitialized rail here. Perhaps splitting get_domain_and_cq_for_endpoint_rail
into two functions, 1) the rail has not been initialized yet, in which case we don't need to pass rail at all, and 2) rail has been initialized, which would really be just the body of the first if
statement of get_domain_and_cq_for_endpoint_rail
.
Closing PR since issue will be resolved as soon as PR #665 gets merged. |
When domain-per-thread feature is enabled, a domain is created for the data rail as well as for the control rail. When the control rail is initialized with it's own domain, endpoint of control rail from control domain is bound to cq of data domain.
This commit change the code such that only one domain is used for data and control rail in case domain-per-thread feature is enabled.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.