Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Bad allocation warning #113

Closed
jasmin-guven opened this issue Oct 17, 2023 · 22 comments
Closed

[BUG] Bad allocation warning #113

jasmin-guven opened this issue Oct 17, 2023 · 22 comments
Labels
bug Something isn't working

Comments

@jasmin-guven
Copy link

Describe the bug
I'm running somd-freenrg with a slurm submission script and my runs seem to only do 1 cycle (according to the somd.out). Looking at somd.err I only get a UserWarning and slurm does not produce any errors.

To Reproduce
I've attached one of my lambda run directories to the issue with all my files.

  1. Run the code (from a submission script, attached)
    $BSS_HOME/somd-freenrg -C ./somd.cfg -l $lambda -c ./somd.rst7 -t ./somd.prm7 -m ./somd.pert -p CUDA 1> ./somd.out 2> ./somd.err
  2. This is the exception that was raised:
Traceback (most recent call last):
  File "/home/jguven/Software/miniconda3/envs/obss/share/Sire/scripts/somd-freenrg.py", line 161, in <module>
    OpenMMMD.runFreeNrg(params)
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.10/site-packages/sire/legacy/Tools/__init__.py", line 175, in inner
    retval = func()
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.10/site-packages/sire/legacy/Tools/OpenMMMD.py", line 3041, in runFreeNrg
    Sire.Stream.save([system, moves], restart_file.val)
UserWarning: std::bad_alloc

Expected behavior
All cycles should finish normally, and somd.err should not raise the bad_alloc warning.

Input files
scripts_for_issue.zip
issue.zip

(please complete the following information):

  • OS: Linux
  • Version of Python: 3.10
  • Version of sire: 2023.3.2
  • I confirm that I have checked this bug still exists in the latest released version of sire: [no]
@jasmin-guven jasmin-guven added the bug Something isn't working label Oct 17, 2023
@jasmin-guven jasmin-guven changed the title [BUG] [BUG] Bad allocation warning Oct 17, 2023
@lohedges
Copy link
Contributor

It's working fine for me with the latest version of Sire.

...
###====================somd-freenrg run=======================###
Starting somd-freenrg run...
4000 moves 5 cycles, 40 ps simulation time

Cycle =  1

Backing up previous restart
Saving new restart

Cycle =  2

Backing up previous restart
Saving new restart

...

@lohedges
Copy link
Contributor

(Note that I shortened the number of moves for debugging purposes.)

@jasmin-guven
Copy link
Author

Hi Lester,

Can I just ask if you used the SLURM script to run it or did you run it from command line?

I will also try it with the latest version.

@lohedges
Copy link
Contributor

I just ran directly from the command line using the somd-freenrg command in the SLURM script, running within the issue directory.

@chryswoods
Copy link
Contributor

std::bad_alloc exceptions tend to be raised when the operating system can't allocate the memory that is being requested. I suspect that the compute node ran out of memory. The Sire.Stream.save uses quite a bit of memory, as the entire molecular system has to be copied into a large binary buffer, which is then copied again as it is compressed and saved to disk.

It may be worth checking the amount of memory consumed by your job on your cluster, or trying to run the job again asking for more memory from the scheduler.

@chryswoods
Copy link
Contributor

Have you been able to see if this was caused by running out of memory?

@jasmin-guven
Copy link
Author

Hi @chryswoods, sorry for not updating you sooner, I've been waiting for some of the calculations to run to see if the error was still happening.

What I did was to check my slurm configruations and after making sure that wasn't an issue, I increased the number of cycles to 20 (rather than 5, which is what I chose to increase efficiency) and that seems to have fixed the bad allocation error. The only problem with is is that it has increased my computation time by almost a factor of two.

I was wondering if it would make sense to raise the bad allocation as an actual error rather than a UserWarning?

@jasmin-guven
Copy link
Author

Also realised that my Python is 3.10 which I believe should be an older version, going by the installation instructions. I will install the correct version and see if this also fixes things.

@chryswoods
Copy link
Contributor

Thanks - yes, I agree that the naming of UserWarning is not helpful. It is the standard name of the exception that is used by the Python wrapping library we use when the exception is not recognised.

I'll look to translate a bad_alloc into a more meaningful exception, and will add some help text that says that this may be caused by running out of memory.

Your Python 3.10 is fine. We support Python versions 3.9, 3.10 and 3.11. The version of python shouldn't have any impact on this bug. We aim to support the last three major releases of Python, i.e. next year we will transition to supporting 3.10, 3.11 and 3.12.

@jasmin-guven
Copy link
Author

Thanks that would be really helpful!

@jasmin-guven
Copy link
Author

Hi, just as an update, I tried rerunning the runs on a different computer (this time using a HPC) but I got the following somd error:

QBuffer::writeData: Memory allocation error

I spoke to Anna Herz and she suggested to change the ncycles to be 4 so that there is one cycle per nanosecond of simulation. I'm currently testing this out both on my workstation and on a HPC.

@jasmin-guven
Copy link
Author

I've just tried it again with the below somd.cfg options and again get the bad allocation warning:

save coordinates = True
ncycles = 4
nmoves = 500000
ncycles_per_snap = 1
buffered coordinates frequency = 200
timestep = 2.00 femtosecond
reaction field dielectric = 78.3
cutoff type = cutoffperiodic
cutoff distance = 10 angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
inverse friction = 1.00000 picosecond
minimise = False
constraint = hbonds-notperturbed
energy frequency = 250
lambda array = 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
lambda_val = 0.0
perturbed residue number = 17103

@lohedges
Copy link
Contributor

If I'm reading this right you are saving 10000 trajectory frames (500000 x 4 / 200), with 2500 buffered in memory each cycle. (I think it just stores the coordinates.) Not sure if this is causing the memory to overflow.

@jasmin-guven
Copy link
Author

@lohedges so would you suggest to increase the buffered coordinates frequency? If I understand correctly that would decrease the amount of memory stored in each cycle?

I will try again with that and also increasing the number of cycles to 8.

@lohedges
Copy link
Contributor

lohedges commented Dec 1, 2023

Well, I'd just try not saving any frames, or a minimal number. That would be an easy way to test whether it's this part of the code that's causing the problem.

@jasmin-guven
Copy link
Author

I've tried again with the following config on my workstation:

save coordinates = True
ncycles = 8
nmoves = 250000
ncycles_per_snap = 1
buffered coordinates frequency = 250000
timestep = 2.00 femtosecond
reaction field dielectric = 78.3
cutoff type = cutoffperiodic
cutoff distance = 10 angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
inverse friction = 1.00000 picosecond
minimise = False
constraint = hbonds-notperturbed
energy frequency = 250
lambda array = 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0
lambda_val = 0.0
perturbed residue number = 17103

And got the bad_alloc error again:

Traceback (most recent call last):
  File "/home/jguven/Software/miniconda3/envs/obss/share/Sire/scripts/somd-freenrg.py", line 224, in <module>
    OpenMMMD.runFreeNrg(params)
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.9/site-packages/sire/legacy/Tools/__init__.py", line 175, in inner
    retval = func()
  File "/home/jguven/Software/miniconda3/envs/obss/lib/python3.9/site-packages/sire/legacy/Tools/OpenMMMD.py", line 3043, in runFreeNrg
    Sire.Stream.save([system, moves], restart_file.val)
UserWarning: std::bad_alloc

@chryswoods
Copy link
Contributor

Could you set buffered coordinates frequency to 0. This will disable buffering. The bad_alloc is because something is consuming a lot of memory and the code is not able to allocate any more.

You could also add

minimal coordinate saving = True

to tell somd to only save coordinates for lambda=0 and lambda=1.

I have (in another branch that will be merged into 2023.5.0) changed the warning into an error that says that the code is using too much memory.

@chryswoods
Copy link
Contributor

Just to add that PR #134 includes a fix for the wrapping of bad_alloc so that it is translated into a Python MemoryError and gives a more meaningful error message. This will be included in the 2023.5.0 release which we hope to get out before Christmas.

@jasmin-guven
Copy link
Author

Hi, sorry again for the long wait. I ran the run with the below config options on the HPC and still get the bad_alloc error on my somd.err.

I did set minimal coordinate saving to True, but do I also have to turn off the save coordinates?

save coordinates = True
ncycles = 4
nmoves = 500000
ncycles_per_snap = 1
buffered coordinates frequency = 0
timestep = 2.00 femtosecond
reaction field dielectric = 78.3
cutoff type = cutoffperiodic
cutoff distance = 10 angstrom
barostat = True
pressure = 1.00000 atm
thermostat = True
temperature = 300.00 kelvin
inverse friction = 1.00000 picosecond
minimise = False
constraint = hbonds-notperturbed
energy frequency = 250
lambda array = 0.0, 0.0667, 0.1333, 0.2, 0.2667, 0.3333, 0.4, 0.4667, 0.5333, 0.6, 0.6667, 0.7333, 0.8, 0.8667, 0.9333, 1.0
lambda_val = 0.0
perturbed residue number = 17103
minimal coordinate saving = True

@chryswoods
Copy link
Contributor

chryswoods commented Dec 11, 2023

Yes, you could try completely disabling coordinate saving by turning off the save coordinates. However, I am surprised that you are running out of memory. You have a small(ish) number of atoms in your system (~54k) so memory shouldn't be an issue.

Are you able to control how much memory you are requesting on your cluster via your slurm script? How much are you requesting? I would expect you to need at least 4 GB - 16 GB to be able to run this job. I took a look at your slurm script and I couldn't see where you are requesting the amount of memory you want to use. I know on some clusters, if you don't specify, then you can end up with very small amounts, e.g. 1-2 GB.

@chryswoods
Copy link
Contributor

Were you able to run the job on your cluster? Let us know if everything is ok. We will automatically close the issue if there's no update by the end of the month.

@chryswoods
Copy link
Contributor

Closing due to inactivity - please feel free to reopen if you still need help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants