Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SOMD2 Crashes When Saving A Large System #60

Closed
akalpokas opened this issue Oct 25, 2024 · 21 comments
Closed

[BUG] SOMD2 Crashes When Saving A Large System #60

akalpokas opened this issue Oct 25, 2024 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@akalpokas
Copy link
Contributor

Describe the bug

SOMD2 seems to crash when trying to save a relatively large system (120K atoms, although partially desolvated system of 70K atoms seems to crash as well). The reason why I suspect this might be do to system size is because the crash seems to occur whenever the system hits either --frame-frequency or --checkpoint-frequency evaluation time, which means it crashes during the saving process. I was also told that the free leg of this perturbation runs fine, again hinting that this might be a system size issue potentially. The third hint is the way the SOMD2 crashes; it doesn't crash a specific lambda window and then restart for 0.125, it crashes every single process simultaneously, which from my quick googling points towards some sort of out of memory issue (although I don't see the memory disappear during the saving process on my machine).

I tested and observed this across multiple versions of sire and somd2, with the most recent one installed today (see the details below).

I appreciate that this might be a bit more of a sire issue rather than SOMD2 (I'm happy to migrate it there), the reason why I am raising it here because it's hard to figure out what happened from the error message so I am wondering if there is scope to catch this kind of issue in the future and alert the user in a more specific way.

Exact error reported is:

2024-10-25 17:07:58.971 | INFO     | somd2.runner._runner:__init__:85 - somd2 version: 0.1.dev451+gf0347a7.d20241025
2024-10-25 17:07:59.048 | INFO     | somd2.runner._runner:__init__:86 - sire version: 2024.3.0.dev+2dc4c6e
2024-10-25 17:07:59.324 | INFO     | somd2._utils._ghosts:_boresch:71 - Applying Boresch modifications to ghost atom bonded terms.
2024-10-25 17:08:14.041 | WARNING  | somd2.runner._runner:__init__:186 - Charge difference between end states is not an integer.
2024-10-25 17:08:14.478 | INFO     | somd2.runner._runner:_repartition_h_mass:903 - Repartitioning hydrogen masses with factor 1.500
2024-10-25 17:08:16.886 | INFO     | somd2.runner._runner:_get_gpu_devices:809 - CUDA_VISIBLE_DEVICES set to ['2']
2024-10-25 17:08:16.886 | INFO     | somd2.runner._runner:_get_gpu_devices:824 - Number of GPUs available: 1
2024-10-25 17:08:24.580 | INFO     | somd2.runner._runner:run_window:1160 - Running λ = 0.0 on GPU 0
2024-10-25 17:08:59.583 | INFO     | somd2.runner._dynamics:_minimisation:242 - Minimising at λ = 0.0
2024-10-25 17:09:47.426 | INFO     | somd2.runner._dynamics:_equilibration:298 - Equilibrating at λ = 0.0
2024-10-25 17:10:26.603 | INFO     | somd2.runner._dynamics:_minimisation:267 - Minimising at λ = 0.0
2024-10-25 17:11:27.897 | INFO     | somd2.runner._dynamics:_run:387 - Running dynamics at λ = 0.0
2024-10-25 17:11:48.981 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.0: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.982 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.125: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.983 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.25: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.983 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.375: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.983 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.5: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.984 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.625: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.984 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.75: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.984 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 0.875: A process in the process pool was terminated abruptly while the future was running or pending.
2024-10-25 17:11:48.985 | ERROR    | somd2.runner._runner:run:1003 - Exception raised for λ = 1.0: A process in the process pool was terminated abruptly while the future was running or pending.

To reproduce

Extract the providedz file and run the SOMD2 input file with:

somd2 4S-4R8protb.bss --log-level debug --timestep 2fs --num-lambda 9 --runtime 4ns --equilibration-timestep 1fs --equilibration-time 1ps --frame-frequency 10ps --checkpoint-frequency 50ps

4S-4R8protb.zip


Environment information:
SOMD2 version: 0.1.dev451+gf0347a7.d20241025
Sire version: 2024.3.0.dev+2dc4c6e

@lohedges
Copy link
Contributor

Thanks, @akalpokas.

This seems to be chugging away for me on my laptop. (I've finished 5 blocks without issue.) A few observations:

  • My laptop RAM usage seems fairly constant at about 10GB.
  • My GPU RAM usage is also constant at around 2GB.
  • Checkpoint files are about 27MB in size.
  • Trajectory chunks are about 8MB in size.

Some quick questions:

  • In your example above it looks like it's crashing at the first checkpoint. Is this correct? If so, is this always the case, or does it just crash at some checkpoint?
  • I see you are using a single GPU in the example, but setting with CUDA_VISIBLE_DEVICES=2, so presumably there is more than one on the system. Is anything else running at the same time?

My initial thoughts is some issue with the Sire trajectory cache. We noticed some issues that are logged here and have adjusted the somd2 chunking to account for this. The system object never contains the trajectory when streamed, but this is still stored as an internal property. The cache should deal with this and offload things to disk as needed. However, I'm not sure where it offloads so I wonder if there are issues if the path it writes to is full, or not writeable. (Not sure if it uses the current directory, or TMPDIR. (I'll look into this.)

Another thought is parallelisation in Sire. By default it will use TBB to do certain operations in parallel. You can restrict the number of threads with the SIRE_NUM_THREADS environment variable. It might be worth trying this in case the overhead of maintaining threads for a large system is causing issues, or if it's simply trying to use too many threads.

Cheers,
Lester

@lohedges
Copy link
Contributor

I've just checked and the PageCache object using QTemporaryFile, which in turn uses QDir::tempPath(), which is defined here. This does use the TMPDIR environment variable, or /tmp if this isn't defined. If there's not much spaced on TMPDIR or /tmp, then you could try setting the environment variable to another path to see if it helps.

@akalpokas
Copy link
Contributor Author

Hi @lohedges,

Thanks for your help with this. The system tested above crashes at the first checkpoint, if the time frequency is big enough (doesn't crash at 1ps and crashes if it's 100ps). I had other calculations running at the same time, and while I have ~56.2 GiB Available memory left, there is only ~3.7 GiB Free memory left when SOMD2 calculation starts. Less frequent checkpointing presumably uses more memory instantaneously to save all the dynamics information and likely causes the machine to run out of memory while doing so. The issue is probably further compounded by the fact that the system size is quite larger than the usual size we do FEP on.

Since I can run this with very frequent checkpointing, I will check with our collaborator (they brought up this issue to me initially) and hopefully this will fix it for them. Thanks again for your help!

@akalpokas
Copy link
Contributor Author

Just to follow up on this, now that the other calculations have finished on my machine, I tried re-running this as a stand-alone calculation (using 4 GPUs simultaneously, and also just using 1). It still crashes with checkpointing frequency of 100ps and 10ps and 1ps. I also tried changing TMPDIR environment variable, but this unfortunately did not have an effect.

@chryswoods
Copy link

Very strange. The trajectory PageCache code should have meant that all trajectory data was going to disk and you shouldn't see the memory on the machine increasing. You can control how much data the cache will hold in memory. By default is will hold a maximum of 32 pages in memory, each of which is a maximum of 8MB (i.e. there should only be a maximum of 256MB of data in memory, with the rest on disk).

You can debug this by using the getStatistics function.

https://github.com/OpenBioSim/sire/blob/4fcc147db1afa4ff5c8e868484eaa18eda3d4ff2/corelib/src/libs/SireBase/pagecache.h#L106

This is a static function that shows all of the cache statistics. Just print this string (the PageCache class is exposed to Python and can also be used to cache Python objects)

@akalpokas
Copy link
Contributor Author

Having added bunch of debug logging calls to SOMD2 and having made a minimal reproducible example with sire, I think I have somewhat narrowed down the problem. Both in SOMD2 and in sire, I can see that a crash occurs not when the system tries to checkpoint necessarily and starts to write files to the disk, but when the dynamics.run() code finishes executing, for which the amount of runtime is controlled by the checkpoint frequency in SOMD2. With this small script to run the input file for a short amount of time, I can see that the code consistently crashes with a segfault after the final dynamics block finishes running (before print("Dynamics complete") is called). When I disabled parallel running in SOMD2 I could see that this was resulting in segfaults too.

The minimal sire code:

import sire as sr

timestep = "2fs"
energy_frequency = "1ps"

constraint = "h-bonds-not-perturbed"
perturbable_constraint = "h-bonds-not-perturbed"

cutoff_type = "PME"

equil_time = "1ps"
run_time = "10ps"

lambda_values = [x / 100.0 for x in range(0, 101, 10)]

mols = sr.load("4S-4R8protb.bss")

for mol in mols.molecules("molecule property is_perturbable"):
    mol = sr.morph.link_to_reference(mol)
    mol = sr.morph.repartition_hydrogen_masses(mol, mass_factor=1.5)
    mols.update(mol)

mols = mols.minimisation(cutoff_type=cutoff_type).run().commit()


for i, lambda_value in enumerate(lambda_values):
    print(f"Simulating lambda={lambda_value:.2f}")
    # minimise the system at this lambda value
    min_mols = (
        mols.minimisation(
            cutoff_type=cutoff_type,
            lambda_value=lambda_value,
            constraint=constraint,
            perturbable_constraint="none",
        )
        .run()
        .commit()
    )

    # create a dynamics object for the system
    d = min_mols.dynamics(
        timestep=timestep,
        temperature="25oC",
        cutoff_type=cutoff_type,
        lambda_value=lambda_value,
        constraint=constraint,
        perturbable_constraint=perturbable_constraint,
    )

    # generate random velocities
    d.randomise_velocities()

    # equilibrate, not saving anything
    d.run(equil_time, save_frequency=0)
    print("Equilibration complete")
    print(d)

    # get the values of lambda for neighbouring windows
    lambda_windows = lambda_values[max(i - 1, 0) : min(len(lambda_values), i + 2)]

    # run the dynamics, saving the energy every 0.1 ps

    # Crash HERE after this block finishes
    # --------------------------------------------------------------------------------------------------
    d.run(
        run_time,
        energy_frequency=energy_frequency,
        frame_frequency=0,
        lambda_windows=lambda_windows,
    )
    # --------------------------------------------------------------------------------------------------
    print("Dynamics complete")
    print(d)

    # stream the EnergyTrajectory to a sire save stream object
    sr.stream.save(
        d.commit().energy_trajectory(), f"energy_{lambda_value:.2f}.s3"
    )

Whatever wrap up happens when dynamics.run() finishes running is what causes the code to crash. I think do remember seeing this happen before for other systems awhile ago, and the interesting thing is that this occasionally runs fine. From my quick testing it crashes approximately 23 out of 25 times.

@lohedges
Copy link
Contributor

Hmm, interesting. As mentioned here we have also been seeing frequent crashes on interpreter exit, even for simple dynamics simulations. I wonder if they are related.

@lohedges
Copy link
Contributor

I can confirm that your reproducer also crashes for me. I'll try to figure out where the crash is occurring within the final exit block.

@lohedges
Copy link
Contributor

I've dug into the code and can confirm that it is crashing when saving the final trajectory frame here. The final block is processed in the main thread, as can bee seen here. If I pass state_has_cv=(False, False), then the final frame saving is skipped and everything works fine.

I'll try to figure out why the crash is occurring, since frames shouldn't be save since you are specifying a frame frequency of zero.

@lohedges
Copy link
Contributor

If I set a non-zero frame_frequency, then it just crashes the first time it tries to save a frame, rather than at the end, so it' just the frame saving that's causing the issue. (Not saving a frame when it's not meant to.)

@lohedges
Copy link
Contributor

I can confirm that the crash happens when converting the frame to a byte array here. From reading around it looks like QByteArray is limited to 2GB in size, so this is probably the issue. (You also need this amount of contiguous memory available on the system.) If so, then we probably need some way of splitting large frames and writing to the cache in chunks.

@lohedges
Copy link
Contributor

Ah, hadn't realised that frame.toByteArray was our own function. I'll investigate where it crashes in there.

@lohedges
Copy link
Contributor

Okay, I thought I'd fixed the bug. The SireMol::Frame::toByteArray incorrectly uses .count() to determine the number of bytes in a QByteArray object here. This is used to count the occurences of a byte. Instread, .size() should be used. (We actually had this same bug in one of the parsers.) However, this doesn't fix the crashes. I'll take a look to see if .count() has been used incorrectly elsewhere.

@akalpokas
Copy link
Contributor Author

This is definitely related to the size of the system, I made another test system which is just ethane-->methanol in a large water box of around 100K molecules. It crashes during the end of the dynamics, during the frame saving just like the system above. If I make ethane-->methanol in a smaller water box of 4K molecules, it runs fine.

@lohedges
Copy link
Contributor

Ah yes, I've found more uses of .count() later in the function. Switching these to .size() still doesn't fix things, though. There are lots of other places in the file (not the function) where .count() is also used, so I'll fix those too and see what happens. I wouldn't be surprised if the same thing happens elsewhere in other trajectory related parts of the codebase.

@lohedges
Copy link
Contributor

Yes, I'm sure whatever bug it is is being triggered by the system being large.

The exact line that it crashes at is here. Just trying to figure out why.

@lohedges
Copy link
Contributor

Okay, I've fixed it. The issue was the way in which the QByteArray was being initialised, i.e.:

QByteArray data("\0", nbytes);

According to the documentation here this:

Constructs a byte array containing the first size bytes of array data.

This is not what we want. The array is a single character, so we are overflowing it with the size argument. We actually want to use the other constructor, i.e.:

QByteArray data(nbytes, '\0');

From the docs, this does:

Constructs a byte array of size size with every byte set to character ch.

WIth this I get no crashes for your system, although I still see the other issue on interpreter exit. I'll open a Sire issue and a branch to fix it. Not sure why this hasn't caused issues before, since any sensible number of bytes would have overflowed the input data array.

@lohedges
Copy link
Contributor

Should now be fixed here. Thanks for reporting!

@lohedges lohedges added the bug Something isn't working label Oct 28, 2024
@lohedges lohedges self-assigned this Oct 28, 2024
@akalpokas
Copy link
Contributor Author

I have recompiled sire with your fix @lohedges and I can confirm that it works now. Thanks a lot for your quick help as always! It is interesting that on a rare occasion it would still run properly even with the QByteArray initialized incorrectly.

@lohedges
Copy link
Contributor

lohedges commented Oct 28, 2024

Yes, I'm surprised that this issue wasn't triggered more easily. It probably did cause some kind of memory corruption, but somehow avoided a full-blown segmentation fault.

@chryswoods
Copy link

Thanks for debugging and catching that - quite a silly bug. I don't know what I was thinking when writing that ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants