-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] SOMD2 Crashes When Saving A Large System #60
Comments
Thanks, @akalpokas. This seems to be chugging away for me on my laptop. (I've finished 5 blocks without issue.) A few observations:
Some quick questions:
My initial thoughts is some issue with the Sire trajectory cache. We noticed some issues that are logged here and have adjusted the Another thought is parallelisation in Sire. By default it will use TBB to do certain operations in parallel. You can restrict the number of threads with the Cheers, |
I've just checked and the |
Hi @lohedges, Thanks for your help with this. The system tested above crashes at the first checkpoint, if the time frequency is big enough (doesn't crash at Since I can run this with very frequent checkpointing, I will check with our collaborator (they brought up this issue to me initially) and hopefully this will fix it for them. Thanks again for your help! |
Just to follow up on this, now that the other calculations have finished on my machine, I tried re-running this as a stand-alone calculation (using 4 GPUs simultaneously, and also just using 1). It still crashes with checkpointing frequency of |
Very strange. The trajectory PageCache code should have meant that all trajectory data was going to disk and you shouldn't see the memory on the machine increasing. You can control how much data the cache will hold in memory. By default is will hold a maximum of 32 pages in memory, each of which is a maximum of 8MB (i.e. there should only be a maximum of 256MB of data in memory, with the rest on disk). You can debug this by using the getStatistics function. This is a static function that shows all of the cache statistics. Just print this string (the PageCache class is exposed to Python and can also be used to cache Python objects) |
Having added bunch of debug logging calls to SOMD2 and having made a minimal reproducible example with sire, I think I have somewhat narrowed down the problem. Both in SOMD2 and in sire, I can see that a crash occurs not when the system tries to checkpoint necessarily and starts to write files to the disk, but when the The minimal sire code: import sire as sr
timestep = "2fs"
energy_frequency = "1ps"
constraint = "h-bonds-not-perturbed"
perturbable_constraint = "h-bonds-not-perturbed"
cutoff_type = "PME"
equil_time = "1ps"
run_time = "10ps"
lambda_values = [x / 100.0 for x in range(0, 101, 10)]
mols = sr.load("4S-4R8protb.bss")
for mol in mols.molecules("molecule property is_perturbable"):
mol = sr.morph.link_to_reference(mol)
mol = sr.morph.repartition_hydrogen_masses(mol, mass_factor=1.5)
mols.update(mol)
mols = mols.minimisation(cutoff_type=cutoff_type).run().commit()
for i, lambda_value in enumerate(lambda_values):
print(f"Simulating lambda={lambda_value:.2f}")
# minimise the system at this lambda value
min_mols = (
mols.minimisation(
cutoff_type=cutoff_type,
lambda_value=lambda_value,
constraint=constraint,
perturbable_constraint="none",
)
.run()
.commit()
)
# create a dynamics object for the system
d = min_mols.dynamics(
timestep=timestep,
temperature="25oC",
cutoff_type=cutoff_type,
lambda_value=lambda_value,
constraint=constraint,
perturbable_constraint=perturbable_constraint,
)
# generate random velocities
d.randomise_velocities()
# equilibrate, not saving anything
d.run(equil_time, save_frequency=0)
print("Equilibration complete")
print(d)
# get the values of lambda for neighbouring windows
lambda_windows = lambda_values[max(i - 1, 0) : min(len(lambda_values), i + 2)]
# run the dynamics, saving the energy every 0.1 ps
# Crash HERE after this block finishes
# --------------------------------------------------------------------------------------------------
d.run(
run_time,
energy_frequency=energy_frequency,
frame_frequency=0,
lambda_windows=lambda_windows,
)
# --------------------------------------------------------------------------------------------------
print("Dynamics complete")
print(d)
# stream the EnergyTrajectory to a sire save stream object
sr.stream.save(
d.commit().energy_trajectory(), f"energy_{lambda_value:.2f}.s3"
) Whatever wrap up happens when |
Hmm, interesting. As mentioned here we have also been seeing frequent crashes on interpreter exit, even for simple dynamics simulations. I wonder if they are related. |
I can confirm that your reproducer also crashes for me. I'll try to figure out where the crash is occurring within the final exit block. |
I've dug into the code and can confirm that it is crashing when saving the final trajectory frame here. The final block is processed in the main thread, as can bee seen here. If I pass I'll try to figure out why the crash is occurring, since frames shouldn't be save since you are specifying a frame frequency of zero. |
If I set a non-zero |
I can confirm that the crash happens when converting the frame to a byte array here. From reading around it looks like |
Ah, hadn't realised that |
Okay, I thought I'd fixed the bug. The |
This is definitely related to the size of the system, I made another test system which is just ethane-->methanol in a large water box of around 100K molecules. It crashes during the end of the dynamics, during the frame saving just like the system above. If I make ethane-->methanol in a smaller water box of 4K molecules, it runs fine. |
Ah yes, I've found more uses of |
Yes, I'm sure whatever bug it is is being triggered by the system being large. The exact line that it crashes at is here. Just trying to figure out why. |
Okay, I've fixed it. The issue was the way in which the QByteArray data("\0", nbytes); According to the documentation here this:
This is not what we want. The array is a single character, so we are overflowing it with the size argument. We actually want to use the other constructor, i.e.: QByteArray data(nbytes, '\0'); From the docs, this does:
WIth this I get no crashes for your system, although I still see the other issue on interpreter exit. I'll open a Sire issue and a branch to fix it. Not sure why this hasn't caused issues before, since any sensible number of bytes would have overflowed the input data array. |
Should now be fixed here. Thanks for reporting! |
I have recompiled sire with your fix @lohedges and I can confirm that it works now. Thanks a lot for your quick help as always! It is interesting that on a rare occasion it would still run properly even with the |
Yes, I'm surprised that this issue wasn't triggered more easily. It probably did cause some kind of memory corruption, but somehow avoided a full-blown segmentation fault. |
Thanks for debugging and catching that - quite a silly bug. I don't know what I was thinking when writing that ;-) |
Describe the bug
SOMD2 seems to crash when trying to save a relatively large system (120K atoms, although partially desolvated system of 70K atoms seems to crash as well). The reason why I suspect this might be do to system size is because the crash seems to occur whenever the system hits either
--frame-frequency
or--checkpoint-frequency
evaluation time, which means it crashes during the saving process. I was also told that the free leg of this perturbation runs fine, again hinting that this might be a system size issue potentially. The third hint is the way the SOMD2 crashes; it doesn't crash a specific lambda window and then restart for0.125
, it crashes every single process simultaneously, which from my quick googling points towards some sort of out of memory issue (although I don't see the memory disappear during the saving process on my machine).I tested and observed this across multiple versions of sire and somd2, with the most recent one installed today (see the details below).
I appreciate that this might be a bit more of a sire issue rather than SOMD2 (I'm happy to migrate it there), the reason why I am raising it here because it's hard to figure out what happened from the error message so I am wondering if there is scope to catch this kind of issue in the future and alert the user in a more specific way.
Exact error reported is:
To reproduce
Extract the providedz file and run the SOMD2 input file with:
somd2 4S-4R8protb.bss --log-level debug --timestep 2fs --num-lambda 9 --runtime 4ns --equilibration-timestep 1fs --equilibration-time 1ps --frame-frequency 10ps --checkpoint-frequency 50ps
4S-4R8protb.zip
Environment information:
SOMD2 version: 0.1.dev451+gf0347a7.d20241025
Sire version: 2024.3.0.dev+2dc4c6e
The text was updated successfully, but these errors were encountered: