hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

e-ago · 2018-05-11T12:45:03Z

DGX-1V settings:

Tesla V100
CUDA 9.2
r396.14
libgdsync sm_70

gds_kernel_loopback_latency executes without errors.

gds_kernel_latency returns some errors:

STDOUT:

pre-posting took 2272.00 usec

batch info: rx+kernel+tx 20 per batch
pre-posted 60 sequences in 2 batches
GPU kernel calc buf size: 131072
iters=1000 tx/rx_depth=1024

testing....
[1] batch 1: posted 20 sequences
[1] batch 2: posted 20 sequences
pre-posting took 2757.00 usec
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
....

STDERR

[3] unexpected rx ev 12, batch len 20
[4] unexpected rx ev 11, batch len 20
[5] unexpected rx ev 13, batch len 20
[6] unexpected rx ev 11, batch len 20
[7] unexpected rx ev 13, batch len 20
[8] unexpected tx ev 18, batch len 20
[8] unexpected rx ev 14, batch len 20
[9] unexpected rx ev 16, batch len 20
….

Sometimes it gets stuck and sometimes it finishes the execution.

HPGMG doesn't show any error but results are incorrect (both CUDA 9.0 and 9.2). For instance, having a run with 2 procs, SA model, input params 5 and 8:

===== Warming up by running 10 solves ==========================================
FMGSolve... f-cycle     norm=1.308041533821267e-02  rel=1.352815237149764e-02  done (0.014266 seconds)
FMGSolve... f-cycle     norm=6.932829732453349e-05  rel=7.170137534722465e-05  done (0.008652 seconds)
FMGSolve... f-cycle     norm=2.088214618404847e-04  rel=2.159693313379883e-04  done (0.008700 seconds)
FMGSolve... f-cycle     norm=1.278232634365474e-03  rel=1.321985991790363e-03  done (0.008623 seconds)
FMGSolve... f-cycle     norm=1.514943477709386e-03  rel=1.566799346255277e-03  done (0.008639 seconds)
FMGSolve... f-cycle     norm=6.932835203155019e-05  rel=7.170143192683918e-05  done (0.008656 seconds)
FMGSolve... f-cycle     norm=8.649945592323429e-04  rel=8.946029537476847e-04  done (0.009008 seconds)
FMGSolve... f-cycle     norm=1.532839596755178e-02  rel=1.585308041816546e-02  done (0.008753 seconds)
FMGSolve... f-cycle     norm=6.936241119703812e-05  rel=7.173665692302264e-05  done (0.008598 seconds)
FMGSolve... f-cycle     norm=6.932763278943987e-05  rel=7.170068806537846e-05  done (0.008599 seconds)

WARMUP TIME: 0.093335


===== Running 100 solves ========================================================
FMGSolve... f-cycle     norm=6.932763278943987e-05  rel=7.170068806537846e-05  done (0.009424 seconds)

Correct result would be
FMGSolve... f-cycle norm=6.934041112871547e-05 rel=7.171390380175266e-05 done (0.010723 seconds)

These errors don't appear on brdw0/1 P100.

The text was updated successfully, but these errors were encountered:

drossetti · 2018-05-11T19:30:18Z

is this with GPU (RDMA) or CPU memory?

drossetti · 2018-05-11T19:30:36Z

we know there is an RDMA bug in 396

drossetti · 2018-05-11T19:31:19Z

thinking more, gds_kernel_latency and HPGMG could be separate problems

e-ago · 2018-05-11T20:23:50Z

GPU memory in case of gds_kernel_latency
host memory in case of hpgmg

I agree we should separate the problems

e-ago added the bug label May 11, 2018

e-ago changed the title ~~gds_kernel_latency errors on DGX TeslaV100~~ hpgmg and gds_kernel_latency errors on DGX TeslaV100 May 11, 2018

e-ago mentioned this issue May 23, 2018

support for extended memops #59

Merged

drossetti mentioned this issue May 24, 2018

gds_kernel_latency fails on V100 when NOR is enabled #68

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

e-ago commented May 11, 2018 •

edited

Loading

drossetti commented May 11, 2018

drossetti commented May 11, 2018

drossetti commented May 11, 2018

e-ago commented May 11, 2018

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

Comments

e-ago commented May 11, 2018 • edited Loading

drossetti commented May 11, 2018

drossetti commented May 11, 2018

drossetti commented May 11, 2018

e-ago commented May 11, 2018

e-ago commented May 11, 2018 •

edited

Loading