Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RoCE protocol in tests #10

Open
ksang opened this issue May 17, 2017 · 2 comments
Open

Support RoCE protocol in tests #10

ksang opened this issue May 17, 2017 · 2 comments

Comments

@ksang
Copy link

ksang commented May 17, 2017

I'm running tests gds_kernel_latency and gds_kernel_loopback_latency
Both of them complaining the same error.
Please see below logs and instruct if I'm running tests correctly.

Platform:
MLNX_OFED_LINUX-4.0-1.0.1.0
CUDA 8.0
Ubuntu 16.04.1 LTS, 4.4.0-75-generic

Logs:

./gds_kernel_loopback_latency
libgdsync build version 0x00010001, major=1 minor=1
libgdsync queried version 0x00010001
[0] pid=55803 server:localhost
requested IB device: <(null)>
picking 1st available device
use gpumem: 0
initializing CUDA
There are 1 devices supporting CUDA, picking N.0
GPU id:0 dev:0 name:Tesla K40c pci 7:0
creating CUDA Primary Ctx on device:0 id:0
making it the current CUDA Ctx
num SMs per GPU:15
clock rate:745000
created main test CUDA stream 0x15f3d20
created stream server CUDA stream 0x1733710
created stream cliebt CUDA stream 0x17b6ad0
created 0 tracking event 0x185c470
created 1 tracking event 0x185c5d0
created 2 tracking event 0x185c730
created 3 tracking event 0x185c890
allocating CPU memory buf
allocated CPU buffer address at 0x1899000
ctx buf=0x1899000
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000001305ba0000
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000001305bc0000
  local address:  LID 0x0000, QPN 0x001a5c, PSN 0x694def: GID ::
0000:001a5c:694def:
0000:001a5c:694def
  remote address: LID 0x0000, QPN 0x001a5c, PSN 0x694def, GID ::
libibverbs: GRH is mandatory For RoCE address handle
Failed to create AH
mpirun -n 2 ./gds_kernel_latency
libgdsync build version 0x00010001, major=1 minor=1
libgdsync queried version 0x00010001
libgdsync build version 0x00010001, major=1 minor=1
libgdsync queried version 0x00010001
[0] pid=55809 client:WD195041
[1] pid=55810 server:WD195041
[0] picking 1st available device
initializing CUDA
There are 1 devices supporting CUDA, picking N.0
[1] picking 1st available device
initializing CUDA
There are 1 devices supporting CUDA, picking N.0
GPU id:0 dev:0 name:Tesla K40c pci 7:0
creating CUDA Primary Ctx on device:0 id:0
GPU id:0 dev:0 name:Tesla K40c pci 7:0
creating CUDA Primary Ctx on device:0 id:0
making it the current CUDA Ctx
num SMs per GPU:15
making it the current CUDA Ctx
num SMs per GPU:15
clock rate:745000
clock rate:745000
created main test CUDA stream 0xb81660
created stream server CUDA stream 0xcc4e80
created main test CUDA stream 0x2316690
created stream cliebt CUDA stream 0xd47340
created stream server CUDA stream 0x2459eb0
created 0 tracking event 0xdedc10
created 1 tracking event 0xdedd70
created stream cliebt CUDA stream 0x24dc370
created 0 tracking event 0x2582c40
created 2 tracking event 0xdeded0
created 3 tracking event 0xdee030
created 1 tracking event 0x2582da0
created 2 tracking event 0x2582f00
created 3 tracking event 0x2583060
cuMemAlloc() of a 12288 bytes GPU buffer
allocated GPU buffer address at 0000001305ba0000
ctx buf=0x1305ba0000
poisoning GPU buffer, filled with '0x00' !!!
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000001305ca0000
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000001305cc0000
[0]  local address:  LID 0x0000, QPN 0x001a5d, PSN 0x6a5fbc: GID ▒B▒▒�
cuMemAlloc() of a 12288 bytes GPU buffer
allocated GPU buffer address at 0000001305ba0000
ctx buf=0x1305ba0000
poisoning GPU buffer, filled with '0x00' !!!
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000001305ca0000
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000001305cc0000
[1]  local address:  LID 0x0000, QPN 0x001a5e, PSN 0x80a20a: GID
[0] remote address: LID 0x0000, QPN 0x001a5e, PSN 0x80a20a, GID ::
[1] remote address: LID 0x0000, QPN 0x001a5d, PSN 0x6a5fbc, GID ::
libibverbs: GRH is mandatory For RoCE address handle
Failed to create AH
libibverbs: GRH is mandatory For RoCE address handle
Failed to create AH

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 55809 RUNNING AT WD195041
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
@drossetti
Copy link
Contributor

These tests are derived from an earlier version of libibverbs rc_pingpong and they lack support for RoCE.
They should work in Infiniband mode.
I happily accept patches.

@drossetti drossetti changed the title "Failed to create AH" when running tests Support RoCE protocol in tests May 18, 2017
@drossetti
Copy link
Contributor

as a reminder, both gds_kernel_latency and gds_kernel_loopback_latency need to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants