Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with hdf5 output on current dev branch #3037

Closed
finnolec opened this issue Aug 26, 2019 · 7 comments
Closed

Error with hdf5 output on current dev branch #3037

finnolec opened this issue Aug 26, 2019 · 7 comments
Assignees
Labels
bug a bug in the project's code component: plugin in PIConGPU plugin

Comments

@finnolec
Copy link
Contributor

finnolec commented Aug 26, 2019

I pulled the current dev branch 0f800d0 and created an example based on the Kelvin Helmholtz example. If I have hdf5 output enabled (the default), the simulation instantly crashes and gives me following error messages:

stdout
Running program...
PIConGPU: 0.5.0-dev
  Build-Type: Release

Third party:
  OS:         Linux-3.10.0-693.11.6.el7.x86_64
  arch:       x86_64
  CXX:        GNU (7.3.0)
  CMake:      3.11.3
  CUDA:       9.2.88
  mallocMC:   2.3.1
  Boost:      1.68.0
  MPI:        
    standard: 3.1
    flavor:   OpenMPI (2.1.6)
  PNGwriter:  0.7.0
  libSplash:  1.7.0 (Format 4.0)
  ADIOS:      1.13.1
PIConGPUVerbose PHYSICS(1) | Sliding Window is OFF
PIConGPUVerbose PHYSICS(1) | used Random Number Generator: RNGProvider3XorMin seed: 42
PIConGPUVerbose PHYSICS(1) | Courant c*dt <= 1.00556 ? 1
PIConGPUVerbose PHYSICS(1) | species e: omega_p * dt <= 0.1 ? 0.0319333
PIConGPUVerbose PHYSICS(1) | species i: omega_p * dt <= 0.1 ? 0.000745229
PIConGPUVerbose PHYSICS(1) | macro particles per device: 3686400
PIConGPUVerbose PHYSICS(1) | typical macro particle weighting: 326.577
PIConGPUVerbose PHYSICS(1) | UNIT_SPEED 2.99792e+08
PIConGPUVerbose PHYSICS(1) | UNIT_TIME 1.79e-16
PIConGPUVerbose PHYSICS(1) | UNIT_LENGTH 5.36628e-08
PIConGPUVerbose PHYSICS(1) | UNIT_MASS 2.97492e-28
PIConGPUVerbose PHYSICS(1) | UNIT_CHARGE 5.23234e-17
PIConGPUVerbose PHYSICS(1) | UNIT_EFIELD 9.5224e+12
PIConGPUVerbose PHYSICS(1) | UNIT_BFIELD 31763.3
PIConGPUVerbose PHYSICS(1) | UNIT_ENERGY 2.67372e-11
initialization time: 10sec 721msec = 10 sec
full simulation time: 10sec 741msec = 10 sec
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node gp005 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpiexec during cleanup)

------------------------------
- - - - - job epilog - - - - -
------------------------------
Job ID: 1879910
was running on nodes: gp[005,008-010]
by user: carste06
in partition: gpu
using account: default
number of CPUs used: 96
number of nodes requested: 4
------------------------------
walltime reqd: 23:53:00
walltime used: 00:01:11
------------------------------
Mon Aug 26 15:24:45 CEST 2019
------------------------------

And here is the stderr which is too long to paste it directly into the issue:
stderr

I've run this simulation on the gpu partition of hemera with etc/include/16.cfg, but I think I've had it on k20 too. If I run the simulation without hdf5 output, it runs. It only seems to appear, when hdf5 output is enabled.

@n01r n01r added bug a bug in the project's code component: plugin in PIConGPU plugin labels Aug 26, 2019
@n01r
Copy link
Member

n01r commented Aug 26, 2019

Hmm, the key line in stderr seems to be:

Unhandled exception of type 'St12out_of_range' with message 'vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)', terminating

The following little program reproduces the error, although I don't exactly know where yours comes from, yet.

#include <iostream>
#include <vector>

int bad_function() {
    std::vector<int> v(2);
    return v.at(2);
}

int main() {
    bad_function();
    return 0;
}

Output:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)

@finnolec
Copy link
Contributor Author

If I start the Laser Wakefield example I get the same error messages...

@ax3l
Copy link
Member

ax3l commented Aug 26, 2019

Although this seams to be a logic error on our side, maybe check the last updates/PRs regarding the plugin, I would recommend to use parallel HDF5 not with the ancient OpenMPI version in the current module.

Maybe check if newer modules exist in parallel or request them? Here is a number of recommended releases to mitigate two severe OpenMPI bugs that we found with typical sims on Hemera:
openPMD/openPMD-api#446
Use instead of OpenMPI (2.1.6) the versions:

  • v3.0.4 or newer for the 3.0.X line
  • v3.1.4 or newer for the 3.1.X line
  • v4.0.1 or newer

Alternatively as a quick test, try if exporting export OMPI_MCA_io=^ompio mitigates the issue if it isn't already in your .tpl file.

That said: probably a logic bug in recent changes, best is to compile with debug symbols and find at which exact line in the code this is thrown.

@finnolec
Copy link
Contributor Author

finnolec commented Aug 27, 2019

Alternatively as a quick test, try if exporting export OMPI_MCA_io=^ompio mitigates the issue if it isn't already in your .tpl file.

I'm using the tpl files from the same branch and this line is included in them. So this does not mitigate the issue. But I'll also look into the other issues.

@ax3l
Copy link
Member

ax3l commented Aug 27, 2019

Yes, I think it's likely another issue. Nevertheless, I just recalled that export OMPI_MCA_io=^ompio will actually only work-around one issue out of the two that were fixed in the cited releases.

Feel free to post the line that throws this as soon as you found out.

@sbastrakov
Copy link
Member

sbastrakov commented Aug 28, 2019

@ax3l you are right, it's a recently introduced bug. @finnolec and I debugged it today, #3038 should fix it.

@sbastrakov
Copy link
Member

Fixed by #3038.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug a bug in the project's code component: plugin in PIConGPU plugin
Projects
None yet
Development

No branches or pull requests

4 participants