You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When running GemmRS on two nodes, each with 4 A100 80G connected via NVLINK. Each node has 1 NIC to IB HDR200.
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779]
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] *****************************************
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0907 22:34:09.000000 22438061766464 torch/distributed/run.py:779] *****************************************
before flux shm initialization
before flux shm initialization
before flux shm initialization
before flux shm initialization
after flux shm initialization
after flux shm initialization
after flux shm initialization
after flux shm initialization
[rank1]:[E907 22:34:22.723784108 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x145b3b587f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x145b3b536d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x145b3b663ee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x145b3c86fa36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x145b3c874c50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x145b3c87b90a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x145b3c87dd4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x145b9ef0bbf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x145bb4c711ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x145bb4153e73 in /lib64/libc.so.6)
[rank2]:[E907 22:34:22.821294480 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x14e20c855f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x14e20c804d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x14e20c931ee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x14e20db3da36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x14e20db42c50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x14e20db4990a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x14e20db4bd4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x14e2701d9bf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x14e285f3f1ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x14e285421e73 in /lib64/libc.so.6)
[rank3]:[E907 22:34:22.821407422 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 3] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x1529560c1f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x152956070d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x15295619dee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x1529573a9a36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x1529573aec50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x1529573b590a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x1529573b7d4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x1529b9a45bf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x1529cf7ab1ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x1529cec8de73 in /lib64/libc.so.6)
[rank0]:[E907 22:34:22.823452114 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789115564/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x153118a11f86 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1531189c0d10 in flux/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x153118aedee8 in flux/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x153119cf9a36 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x153119cfec50 in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x153119d0590a in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x153119d07d4c in flux/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x15317c395bf4 in flux/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x81ca (0x1531920fb1ca in /lib64/libpthread.so.0)
frame #9: clone + 0x43 (0x1531915dde73 in /lib64/libc.so.6)
W0907 22:34:23.038000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1834997 closing signal SIGTERM
W0907 22:34:23.038000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1834999 closing signal SIGTERM
W0907 22:34:23.038000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1835000 closing signal SIGTERM
E0907 22:34:23.482000 22438061766464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1834998) of binary: flux/bin/python
Traceback (most recent call last):
File "flux/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.4.1', 'console_scripts', 'torchrun')())
File "flux/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "flux/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "flux/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "flux/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
test/test_gemm_rs.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior
A clear and concise description of what you expected to happen.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment
Each node has 4 A100 80G, connected via NVLINK.
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 NV4 NV4 SYS 24-34,36-46 1 N/A
GPU1 NV4 X NV4 NV4 SYS 0-10,12-22 0 N/A
GPU2 NV4 NV4 X NV4 NODE 72-82,84-94 3 N/A
GPU3 NV4 NV4 NV4 X SYS 48-58,60-70 2 N/A
NIC0 SYS SYS NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
Interconnection is IB HDR200:
$ ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.36.1010
Hardware version: 0
Node GUID: 0xe8ebd30300401a34
System image GUID: 0xe8ebd30300401a34
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 766
LMC: 0
SM lid: 1864
Capability mask: 0xa651e848
Port GUID: 0xe8ebd30300401a34
Link layer: InfiniBand
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Thanks for your interests! @YJHMITWEB
How do you run the test for multi-node? via the launch script we provided or just torchrun?
If you check the launch.sh under script folder, you can see that we haven't released (or at least fully release) multi-node support yet IIRC. cc @zheng-ningxin
Describe the bug
When running GemmRS on two nodes, each with 4 A100 80G connected via NVLINK. Each node has 1 NIC to IB HDR200.
To Reproduce
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.
Expected behavior
A clear and concise description of what you expected to happen.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment
Each node has 4 A100 80G, connected via NVLINK.
Interconnection is IB HDR200:
Proposed fix
If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: