Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

Open
e-ago opened this issue May 11, 2018 · 4 comments
Open

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

e-ago opened this issue May 11, 2018 · 4 comments
Labels

Comments

@e-ago
Copy link
Collaborator

e-ago commented May 11, 2018

DGX-1V settings:

  • Tesla V100
  • CUDA 9.2
  • r396.14
  • libgdsync sm_70

gds_kernel_loopback_latency executes without errors.

gds_kernel_latency returns some errors:

STDOUT:

pre-posting took 2272.00 usec

batch info: rx+kernel+tx 20 per batch
pre-posted 60 sequences in 2 batches
GPU kernel calc buf size: 131072
iters=1000 tx/rx_depth=1024

testing....
[1] batch 1: posted 20 sequences
[1] batch 2: posted 20 sequences
pre-posting took 2757.00 usec
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
....

STDERR

[3] unexpected rx ev 12, batch len 20
[4] unexpected rx ev 11, batch len 20
[5] unexpected rx ev 13, batch len 20
[6] unexpected rx ev 11, batch len 20
[7] unexpected rx ev 13, batch len 20
[8] unexpected tx ev 18, batch len 20
[8] unexpected rx ev 14, batch len 20
[9] unexpected rx ev 16, batch len 20
….

Sometimes it gets stuck and sometimes it finishes the execution.

HPGMG doesn't show any error but results are incorrect (both CUDA 9.0 and 9.2). For instance, having a run with 2 procs, SA model, input params 5 and 8:

===== Warming up by running 10 solves ==========================================
FMGSolve... f-cycle     norm=1.308041533821267e-02  rel=1.352815237149764e-02  done (0.014266 seconds)
FMGSolve... f-cycle     norm=6.932829732453349e-05  rel=7.170137534722465e-05  done (0.008652 seconds)
FMGSolve... f-cycle     norm=2.088214618404847e-04  rel=2.159693313379883e-04  done (0.008700 seconds)
FMGSolve... f-cycle     norm=1.278232634365474e-03  rel=1.321985991790363e-03  done (0.008623 seconds)
FMGSolve... f-cycle     norm=1.514943477709386e-03  rel=1.566799346255277e-03  done (0.008639 seconds)
FMGSolve... f-cycle     norm=6.932835203155019e-05  rel=7.170143192683918e-05  done (0.008656 seconds)
FMGSolve... f-cycle     norm=8.649945592323429e-04  rel=8.946029537476847e-04  done (0.009008 seconds)
FMGSolve... f-cycle     norm=1.532839596755178e-02  rel=1.585308041816546e-02  done (0.008753 seconds)
FMGSolve... f-cycle     norm=6.936241119703812e-05  rel=7.173665692302264e-05  done (0.008598 seconds)
FMGSolve... f-cycle     norm=6.932763278943987e-05  rel=7.170068806537846e-05  done (0.008599 seconds)

WARMUP TIME: 0.093335


===== Running 100 solves ========================================================
FMGSolve... f-cycle     norm=6.932763278943987e-05  rel=7.170068806537846e-05  done (0.009424 seconds)

Correct result would be
FMGSolve... f-cycle norm=6.934041112871547e-05 rel=7.171390380175266e-05 done (0.010723 seconds)

These errors don't appear on brdw0/1 P100.

@e-ago e-ago added the bug label May 11, 2018
@e-ago e-ago changed the title gds_kernel_latency errors on DGX TeslaV100 hpgmg and gds_kernel_latency errors on DGX TeslaV100 May 11, 2018
@drossetti
Copy link
Contributor

is this with GPU (RDMA) or CPU memory?

@drossetti
Copy link
Contributor

we know there is an RDMA bug in 396

@drossetti
Copy link
Contributor

thinking more, gds_kernel_latency and HPGMG could be separate problems

@e-ago
Copy link
Collaborator Author

e-ago commented May 11, 2018

GPU memory in case of gds_kernel_latency
host memory in case of hpgmg

I agree we should separate the problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants