-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add domain object to transports #665
base: master
Are you sure you want to change the base?
Conversation
d1684a6
to
c89b019
Compare
bot:aws:retest |
c89b019
to
21c7c1d
Compare
bot:aws:retest |
21c7c1d
to
66daac7
Compare
p5 ub2004 NCCL-Tests are failing CI with:
|
bot:aws:retest |
bot:aws:retest I can't replicate the crash @a-szegel copied, which could have just been a network fault. |
Builds 3 - 8 have all had issues with p5's. The main suspect is the PR at this point. |
bot:aws:retest |
1 similar comment
bot:aws:retest |
3f6e828
to
23cae5f
Compare
bot:aws:retest |
Switching to draft; we don't want to include this in the 1.13.0 release. |
1573bf6
to
2662dba
Compare
2662dba
to
702f3c6
Compare
rebased here: aws-nslick@796e96a issues are resolved when running against ofiwg/libfabric#10543 |
ceb48dc
to
c1c6557
Compare
c1c6557
to
3e7cefc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved, tested with this patch last week
It turns out that with the domain being the primary boundary of threading in libfabric, it's not great assuming a 1:1 device to domain mapping. We previously hacked something up to make the domain be part of the endpoint, sometimes, but that makes for some really icky code. This patch adds a domain structure to the object hierarchy and unifies the when to create a new domain code between the two transports. Signed-off-by: Brian Barrett <[email protected]>
3e7cefc
to
254821c
Compare
fi_close(&domain->domain_rails[i].cq->fid); | ||
domain->domain_rails[i].cq = NULL; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And free domain_rails
🙂
} | ||
|
||
for (int i = 0 ; i < domain->num_rails ; ++i) { | ||
fi_close(&domain->domain_rails[i].domain->fid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't you close the cq
before closing the domain?
Add a domain object to the transport interface, to reflect the need for fi_domain (and the related MR cache / rkey pool) to either be a single domain per process or a domain per initializing thread, depending on platform. Previously, this was kind of hacked into the code, with extra fi_domain pointers in various structs. Cut a bunch of that out, and move a bunch of the logic into one place, so that the two protocols behave the same.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.