Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add domain object to transports #665

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bwbarrett
Copy link
Contributor

Add a domain object to the transport interface, to reflect the need for fi_domain (and the related MR cache / rkey pool) to either be a single domain per process or a domain per initializing thread, depending on platform. Previously, this was kind of hacked into the code, with extra fi_domain pointers in various structs. Cut a bunch of that out, and move a bunch of the logic into one place, so that the two protocols behave the same.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@bwbarrett bwbarrett requested a review from a team as a code owner October 17, 2024 17:23
@bwbarrett bwbarrett force-pushed the feature/domain-cleanup branch 2 times, most recently from d1684a6 to c89b019 Compare October 17, 2024 19:05
@a-szegel
Copy link
Contributor

bot:aws:retest

@bwbarrett
Copy link
Contributor Author

bot:aws:retest

@a-szegel
Copy link
Contributor

a-szegel commented Oct 19, 2024

p5 ub2004 NCCL-Tests are failing CI with:

[ip-172-31-8-187:39716] *** End of error message ***

ip-172-31-8-187:39710:39776 [2] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer ip-172-31-8-187.af-south-1.compute.internal<51756>
ip-172-31-8-187:39710:39776 [2] NCCL INFO misc/socket.cc:805 -> 6
ip-172-31-8-187:39710:39776 [2] NCCL INFO bootstrap.cc:85 -> 6
ip-172-31-8-187:39710:39776 [2] NCCL INFO bootstrap.cc:543 -> 6
ip-172-31-8-187:39710:39776 [2] NCCL INFO bootstrap.cc:554 -> 6
ip-172-31-8-187:39710:39776 [2] NCCL INFO init.cc:926 -> 6
ip-172-31-8-187:39710:39776 [2] NCCL INFO init.cc:1408 -> 6
ip-172-31-8-187:39710:39776 [2] NCCL INFO group.cc:70 -> 6 [Async thread]
ip-172-31-8-187:39710:39710 [2] NCCL INFO group.cc:420 -> 6
ip-172-31-8-187:39710:39710 [2] NCCL INFO group.cc:546 -> 6
ip-172-31-8-187:39710:39710 [2] NCCL INFO group.cc:101 -> 6
ip-172-31-8-187: Test NCCL failure common.cu:1001 'remote process exited or there was a network error / '
 .. ip-172-31-8-187 pid 39710: Test failure common.cu:880
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 24 with PID 15638 on node 172.31.10.220 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

....

(76 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED test_suites/aws_ofi_nccl/test_nccl_tests.py::test_run_nccl_tests[0x0-base-sendrecv_perf]
===== 1 failed, 27 passed, 11 skipped, 8 deselected in 2478.88s (0:41:18) ======

@xwangxin
Copy link

bot:aws:retest

@bwbarrett
Copy link
Contributor Author

bot:aws:retest

I can't replicate the crash @a-szegel copied, which could have just been a network fault.

@a-szegel
Copy link
Contributor

Builds 3 - 8 have all had issues with p5's. The main suspect is the PR at this point.

@xwangxin
Copy link

bot:aws:retest

1 similar comment
@xwangxin
Copy link

bot:aws:retest

@bwbarrett bwbarrett force-pushed the feature/domain-cleanup branch 3 times, most recently from 3f6e828 to 23cae5f Compare October 23, 2024 03:30
@aws-nslick
Copy link
Contributor

bot:aws:retest

@bwbarrett bwbarrett marked this pull request as draft October 26, 2024 03:34
@bwbarrett
Copy link
Contributor Author

Switching to draft; we don't want to include this in the 1.13.0 release.

@aws-nslick
Copy link
Contributor

rebased here: aws-nslick@796e96a

issues are resolved when running against ofiwg/libfabric#10543

@bwbarrett bwbarrett force-pushed the feature/domain-cleanup branch 2 times, most recently from ceb48dc to c1c6557 Compare November 21, 2024 22:22
@bwbarrett bwbarrett marked this pull request as ready for review November 21, 2024 22:40
aws-nslick
aws-nslick previously approved these changes Nov 22, 2024
Copy link
Contributor

@aws-nslick aws-nslick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved, tested with this patch last week

src/nccl_ofi_net.c Show resolved Hide resolved
src/nccl_ofi_rdma.c Outdated Show resolved Hide resolved
src/nccl_ofi_rdma.c Show resolved Hide resolved
It turns out that with the domain being the primary boundary of
threading in libfabric, it's not great assuming a 1:1 device to
domain mapping.  We previously hacked something up to make
the domain be part of the endpoint, sometimes, but that makes
for some really icky code.

This patch adds a domain structure to the object hierarchy and
unifies the when to create a new domain code between the two
transports.

Signed-off-by: Brian Barrett <[email protected]>
fi_close(&domain->domain_rails[i].cq->fid);
domain->domain_rails[i].cq = NULL;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And free domain_rails 🙂

}

for (int i = 0 ; i < domain->num_rails ; ++i) {
fi_close(&domain->domain_rails[i].domain->fid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you close the cq before closing the domain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants