[BUG] Bad allocation warning #113

jasmin-guven · 2023-10-17T14:09:40Z

Describe the bug
I'm running somd-freenrg with a slurm submission script and my runs seem to only do 1 cycle (according to the somd.out). Looking at somd.err I only get a UserWarning and slurm does not produce any errors.

To Reproduce
I've attached one of my lambda run directories to the issue with all my files.

Run the code (from a submission script, attached)
$BSS_HOME/somd-freenrg -C ./somd.cfg -l $lambda -c ./somd.rst7 -t ./somd.prm7 -m ./somd.pert -p CUDA 1> ./somd.out 2> ./somd.err
This is the exception that was raised:

Traceback (most recent call last):
  File "/home/jguven/Software/miniconda3/envs/obss/share/Sire/scripts/somd-freenrg.py", line 161, in <module>
    OpenMMMD.runFreeNrg(params)
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.10/site-packages/sire/legacy/Tools/__init__.py", line 175, in inner
    retval = func()
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.10/site-packages/sire/legacy/Tools/OpenMMMD.py", line 3041, in runFreeNrg
    Sire.Stream.save([system, moves], restart_file.val)
UserWarning: std::bad_alloc

Expected behavior
All cycles should finish normally, and somd.err should not raise the bad_alloc warning.

Input files
scripts_for_issue.zip
issue.zip

(please complete the following information):

OS: Linux
Version of Python: 3.10
Version of sire: 2023.3.2
I confirm that I have checked this bug still exists in the latest released version of sire: [no]

The text was updated successfully, but these errors were encountered:

lohedges · 2023-10-17T14:50:30Z

It's working fine for me with the latest version of Sire.

...
###====================somd-freenrg run=======================###
Starting somd-freenrg run...
4000 moves 5 cycles, 40 ps simulation time

Cycle =  1

Backing up previous restart
Saving new restart

Cycle =  2

Backing up previous restart
Saving new restart

...

lohedges · 2023-10-17T14:51:23Z

(Note that I shortened the number of moves for debugging purposes.)

jasmin-guven · 2023-10-17T16:06:56Z

Hi Lester,

Can I just ask if you used the SLURM script to run it or did you run it from command line?

I will also try it with the latest version.

lohedges · 2023-10-17T17:21:04Z

I just ran directly from the command line using the somd-freenrg command in the SLURM script, running within the issue directory.

chryswoods · 2023-10-17T17:29:45Z

std::bad_alloc exceptions tend to be raised when the operating system can't allocate the memory that is being requested. I suspect that the compute node ran out of memory. The Sire.Stream.save uses quite a bit of memory, as the entire molecular system has to be copied into a large binary buffer, which is then copied again as it is compressed and saved to disk.

It may be worth checking the amount of memory consumed by your job on your cluster, or trying to run the job again asking for more memory from the scheduler.

chryswoods · 2023-10-25T18:41:05Z

Have you been able to see if this was caused by running out of memory?

jasmin-guven · 2023-10-26T10:36:09Z

Hi @chryswoods, sorry for not updating you sooner, I've been waiting for some of the calculations to run to see if the error was still happening.

What I did was to check my slurm configruations and after making sure that wasn't an issue, I increased the number of cycles to 20 (rather than 5, which is what I chose to increase efficiency) and that seems to have fixed the bad allocation error. The only problem with is is that it has increased my computation time by almost a factor of two.

I was wondering if it would make sense to raise the bad allocation as an actual error rather than a UserWarning?

jasmin-guven · 2023-10-30T10:57:27Z

Also realised that my Python is 3.10 which I believe should be an older version, going by the installation instructions. I will install the correct version and see if this also fixes things.

chryswoods · 2023-10-30T17:24:52Z

Thanks - yes, I agree that the naming of UserWarning is not helpful. It is the standard name of the exception that is used by the Python wrapping library we use when the exception is not recognised.

I'll look to translate a bad_alloc into a more meaningful exception, and will add some help text that says that this may be caused by running out of memory.

Your Python 3.10 is fine. We support Python versions 3.9, 3.10 and 3.11. The version of python shouldn't have any impact on this bug. We aim to support the last three major releases of Python, i.e. next year we will transition to supporting 3.10, 3.11 and 3.12.

jasmin-guven · 2023-10-31T09:59:02Z

Thanks that would be really helpful!

jasmin-guven · 2023-11-30T17:58:57Z

Hi, just as an update, I tried rerunning the runs on a different computer (this time using a HPC) but I got the following somd error:

QBuffer::writeData: Memory allocation error

I spoke to Anna Herz and she suggested to change the ncycles to be 4 so that there is one cycle per nanosecond of simulation. I'm currently testing this out both on my workstation and on a HPC.

jasmin-guven · 2023-11-30T18:15:22Z

I've just tried it again with the below somd.cfg options and again get the bad allocation warning:

save coordinates = True
ncycles = 4
nmoves = 500000
ncycles_per_snap = 1
buffered coordinates frequency = 200
timestep = 2.00 femtosecond
reaction field dielectric = 78.3
cutoff type = cutoffperiodic
cutoff distance = 10 angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
inverse friction = 1.00000 picosecond
minimise = False
constraint = hbonds-notperturbed
energy frequency = 250
lambda array = 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
lambda_val = 0.0
perturbed residue number = 17103

lohedges · 2023-11-30T18:22:14Z

If I'm reading this right you are saving 10000 trajectory frames (500000 x 4 / 200), with 2500 buffered in memory each cycle. (I think it just stores the coordinates.) Not sure if this is causing the memory to overflow.

jasmin-guven · 2023-12-01T11:52:33Z

@lohedges so would you suggest to increase the buffered coordinates frequency? If I understand correctly that would decrease the amount of memory stored in each cycle?

I will try again with that and also increasing the number of cycles to 8.

lohedges · 2023-12-01T15:38:07Z

Well, I'd just try not saving any frames, or a minimal number. That would be an easy way to test whether it's this part of the code that's causing the problem.

jasmin-guven · 2023-12-01T16:42:49Z

I've tried again with the following config on my workstation:

save coordinates = True
ncycles = 8
nmoves = 250000
ncycles_per_snap = 1
buffered coordinates frequency = 250000
timestep = 2.00 femtosecond
reaction field dielectric = 78.3
cutoff type = cutoffperiodic
cutoff distance = 10 angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
inverse friction = 1.00000 picosecond
minimise = False
constraint = hbonds-notperturbed
energy frequency = 250
lambda array = 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
lambda_val = 0.0
perturbed residue number = 17103

And got the bad_alloc error again:

Traceback (most recent call last):
  File "/home/jguven/Software/miniconda3/envs/obss/share/Sire/scripts/somd-freenrg.py", line 224, in <module>
    OpenMMMD.runFreeNrg(params)
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.9/site-packages/sire/legacy/Tools/__init__.py", line 175, in inner
    retval = func()
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.9/site-packages/sire/legacy/Tools/OpenMMMD.py", line 3043, in runFreeNrg
    Sire.Stream.save([system, moves], restart_file.val)
UserWarning: std::bad_alloc

chryswoods · 2023-12-01T20:13:49Z

Could you set buffered coordinates frequency to 0. This will disable buffering. The bad_alloc is because something is consuming a lot of memory and the code is not able to allocate any more.

You could also add

minimal coordinate saving = True

to tell somd to only save coordinates for lambda=0 and lambda=1.

I have (in another branch that will be merged into 2023.5.0) changed the warning into an error that says that the code is using too much memory.

chryswoods · 2023-12-05T18:17:34Z

Just to add that PR #134 includes a fix for the wrapping of bad_alloc so that it is translated into a Python MemoryError and gives a more meaningful error message. This will be included in the 2023.5.0 release which we hope to get out before Christmas.

jasmin-guven · 2023-12-11T16:38:59Z

Hi, sorry again for the long wait. I ran the run with the below config options on the HPC and still get the bad_alloc error on my somd.err.

I did set minimal coordinate saving to True, but do I also have to turn off the save coordinates?

save coordinates = True
ncycles = 4
nmoves = 500000
ncycles_per_snap = 1
buffered coordinates frequency = 0
timestep = 2.00 femtosecond
reaction field dielectric = 78.3
cutoff type = cutoffperiodic
cutoff distance = 10 angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
inverse friction = 1.00000 picosecond
minimise = False
constraint = hbonds-notperturbed
energy frequency = 250
lambda array = 0.0, 0.0667, 0.1333, 0.2, 0.2667, 0.3333, 0.4, 0.4667, 0.5333, 0.6, 0.6667, 0.7333, 0.8, 0.8667, 0.9333, 1.0
lambda_val = 0.0
perturbed residue number = 17103
minimal coordinate saving = True

chryswoods · 2023-12-11T17:54:29Z

Yes, you could try completely disabling coordinate saving by turning off the save coordinates. However, I am surprised that you are running out of memory. You have a small(ish) number of atoms in your system (~54k) so memory shouldn't be an issue.

Are you able to control how much memory you are requesting on your cluster via your slurm script? How much are you requesting? I would expect you to need at least 4 GB - 16 GB to be able to run this job. I took a look at your slurm script and I couldn't see where you are requesting the amount of memory you want to use. I know on some clusters, if you don't specify, then you can end up with very small amounts, e.g. 1-2 GB.

chryswoods · 2024-02-08T19:25:35Z

Were you able to run the job on your cluster? Let us know if everything is ok. We will automatically close the issue if there's no update by the end of the month.

chryswoods · 2024-03-06T22:46:58Z

Closing due to inactivity - please feel free to reopen if you still need help.

jasmin-guven added the bug Something isn't working label Oct 17, 2023

jasmin-guven changed the title ~~[BUG]~~ [BUG] Bad allocation warning Oct 17, 2023

chryswoods closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Bad allocation warning #113

[BUG] Bad allocation warning #113

jasmin-guven commented Oct 17, 2023

lohedges commented Oct 17, 2023

lohedges commented Oct 17, 2023

jasmin-guven commented Oct 17, 2023

lohedges commented Oct 17, 2023

chryswoods commented Oct 17, 2023

chryswoods commented Oct 25, 2023

jasmin-guven commented Oct 26, 2023

jasmin-guven commented Oct 30, 2023

chryswoods commented Oct 30, 2023

jasmin-guven commented Oct 31, 2023

jasmin-guven commented Nov 30, 2023

jasmin-guven commented Nov 30, 2023

lohedges commented Nov 30, 2023

jasmin-guven commented Dec 1, 2023

lohedges commented Dec 1, 2023

jasmin-guven commented Dec 1, 2023

chryswoods commented Dec 1, 2023

chryswoods commented Dec 5, 2023

jasmin-guven commented Dec 11, 2023

chryswoods commented Dec 11, 2023 •

edited

Loading

chryswoods commented Feb 8, 2024

chryswoods commented Mar 6, 2024

[BUG] Bad allocation warning #113

[BUG] Bad allocation warning #113

Comments

jasmin-guven commented Oct 17, 2023

lohedges commented Oct 17, 2023

lohedges commented Oct 17, 2023

jasmin-guven commented Oct 17, 2023

lohedges commented Oct 17, 2023

chryswoods commented Oct 17, 2023

chryswoods commented Oct 25, 2023

jasmin-guven commented Oct 26, 2023

jasmin-guven commented Oct 30, 2023

chryswoods commented Oct 30, 2023

jasmin-guven commented Oct 31, 2023

jasmin-guven commented Nov 30, 2023

jasmin-guven commented Nov 30, 2023

lohedges commented Nov 30, 2023

jasmin-guven commented Dec 1, 2023

lohedges commented Dec 1, 2023

jasmin-guven commented Dec 1, 2023

chryswoods commented Dec 1, 2023

chryswoods commented Dec 5, 2023

jasmin-guven commented Dec 11, 2023

chryswoods commented Dec 11, 2023 • edited Loading

chryswoods commented Feb 8, 2024

chryswoods commented Mar 6, 2024

chryswoods commented Dec 11, 2023 •

edited

Loading