[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called. #218

mb2055 · 2024-07-25T14:24:41Z

Describe the bug
When periodically deleting the frames contained in a system (as is done in SOMD2 when checkpointing), any trajectories contained in a system after calling mols.delete_all_frames() will only contain a single frame. This issue also seems to cause a segmentation fault, but only around 50% of the time.

To Reproduce
Run the following script:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

the output will be the following:

mols currently has 5 frames
mols currently has 1 frames
mols currently has 1 frames
FATAL: exception not rethrown
Aborted (core dumped)

Expected behavior
The output should be

mols currently has 5 frames
mols currently has 5 frames
mols currently has 5 frames

(please complete the following information):

OS: Ubuntu 22.04.4 LTS
Version of Python: 3.12.4
Version of sire: 2024.3.0.dev
I confirm that I have checked this bug still exists in the latest released version of sire: yes

Additional context
Note that everything works as expected if mols.delete_all_frames() is not called:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

The text was updated successfully, but these errors were encountered:

lohedges · 2024-07-25T14:29:37Z

Just to add that I have also been seeing...

FATAL: exception not rethrown
Aborted (core dumped)

about 50% of the time on interpreter shutdown on my feature_emle branch having merged in devel following the 2024.2.0 release. In this case I am not calling .delete_all_frames() anywhere. I just run a short block of dynamics and get the energy trajectory.

chryswoods · 2024-07-25T16:33:00Z

That is interesting - the bug is likely happening because the cache is being improperly shared between the copy of mols in d, and the copy of mols that was output by d.commit(). In theory, it should not be shared, and so running mols.delete_all_frames() should not be deleting all frames for the mols in d, and so the right output from your script should be

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

Let me think about this and investigate further.

lohedges · 2024-10-16T13:41:13Z

If you use the return_as_system=True kwarg to d.commit() then you can see that the underlying system (for the dynamics object) does still contain all of the frames, i.e.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit(return_as_system=True)
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

This still has issues with random crashes on interpreter exit, though.

Looking at the logic in the d.commit() it seems that the issue must be during the update of the original system:

    def commit(self, return_as_system: bool = False):
        if self.is_null():
            return

        self._update_from(
            state=self._get_current_state(include_coords=True, include_velocities=True),
            state_has_cv=(True, True),
            nsteps_completed=self._current_step,
        )

        self._sire_mols.set_energy_trajectory(self._energy_trajectory, map=self._map)

        self._sire_mols.set_ensemble(self.ensemble())

        if return_as_system:
            return self._sire_mols.clone()

        elif self._orig_mols is not None:
            from ..system import System

            if System.is_system(self._orig_mols):
                return self._sire_mols
            else:
                r = self._orig_mols.clone()
                r.update(self._sire_mols.molecules())
                return r
        else:
            return self._sire_mols.clone()

I'll take a look to see what's going wrong.

lohedges · 2024-10-16T13:46:20Z

Okay, this seems to fix it:

diff --git a/src/sire/mol/_dynamics.py b/src/sire/mol/_dynamics.py
index cdb8a56b..965a67f9 100644
--- a/src/sire/mol/_dynamics.py
+++ b/src/sire/mol/_dynamics.py
@@ -1144,7 +1144,7 @@ class DynamicsData:
             from ..system import System

             if System.is_system(self._orig_mols):
-                return self._sire_mols
+                return self._sire_mols.clone()
             else:
                 r = self._orig_mols.clone()
                 r.update(self._sire_mols.molecules())

@chryswoods: Let me know if this is the correct solution.

I'll look at how to actually delete the frames for the purposes of chunking trajectories in somd2.

lohedges · 2024-10-16T14:45:55Z

Hmm, but if you delete the frames from the underlying system object, then it doesn't work as expected again:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 0 frames
mols currently has 0 frames

lohedges · 2024-10-16T14:59:20Z

Yeah, there's weirdness going on. For example, if you delete a single frame (the first frame in this case, but it doesn't matter):

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 4 frames
mols currently has 3 frames

Deleting two frames:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 3 frames
mols currently has 1 frames

I think deleting all frames must close the trajectory file handle somehow, so you are only ever left with one frame. That said, the coordinates of the molecule do appear to be updated, e.g.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()
    print(mols.coordinates()

Gives:

mols currently has 5 frames
( 13.3499 Å, 14.5876 Å, 12.5228 Å )
mols currently has 1 frames
( 13.4061 Å, 14.5797 Å, 12.6688 Å )
mols currently has 1 frames
( 13.2908 Å, 14.5102 Å, 12.5739 Å )

lohedges · 2024-10-16T15:33:36Z

This is what happens when I print out the call to System::saveFrame:

System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 5 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames

So the function to save the frames is being called the correct number of times. I think the number of frames must be inconsistent somewhere.

lohedges · 2024-10-16T15:56:41Z

Ah, I see, there's a SystemFrames object that stores snapshots from the live trajectory. I think this isn't in sync with the trajectory property in the system somehow. In any case, it's easy to adapt the chunking since, I believe, the trajectory is no longer streamed to file. I can just track the current number of frames in the system and only write the new frames to the chunk. I'll just do this and we can figure out the underlying issue here at a later point if it's too fiddly.

lohedges · 2024-10-16T16:37:28Z

Yes, I've updated somd2 and it works perfectly. Can revisit this at a later date if needed.

chryswoods · 2024-10-17T01:16:09Z

Yes, this looks like the issue. There's a lot of challenge and complexity in this code because it mixes objects which are copy on write (System, Molecule) with objects that appear to be copy on write, but are actually explicitly shared (Trajectory contents) and hidden objects that are explicitly shared and (almost) globally cached (the SystemFrames that are stored in the cache. To make it more complex, the cache auto-offloads to disk as each cache frame fills up.

The code was tested well for the normal case (generating a trajectory and saving it) but there's a lot of edge case bugs I guess when doing things like deleting individual trajectory frames from on implicitly shared copy, while the other copy needs a trajectory with all the frames in their original order. In these cases the code has to do a lot of juggling and memory movement.

It would be worth at a future point looking at what somd2 really needs in terms of trajectory management and adding some custom code that implements that scheme (e.g. in C++, so that the trajectory doesn't need to be copied or excessively moved around).

chryswoods · 2024-10-17T01:20:17Z

Also, correct, the trajectory isn't now streamed to a file. The frames are held in a cache that auto-offloads to disk as needed. You only write to a file when you explicitly save the trajectory. The cache keeps about 512MB of trajectory in memory, if I remember correctly, with the rest going to a page cache on disk. The cache is very efficient, memory mapping pages and giving you good random access performance to individual frames (each frame is paged individually). There's also an intermediary cache and a little look ahead so that frames are pulled from disk to memory efficiently, and stay there as long as needed. In general, it should mean that you don't need to do manual trajectory management now during a simulation or have to worry about filling memory. Instead, you can choose what frames to save after the simulation is complete.

lohedges · 2024-10-17T05:24:11Z

Great. This is what I'm now doing, i.e. writing the last N frames as a chunk at a checkpoint, then assembling the whole trajectory at the end. I guess I could probably keep the file open throughout and append, but this works well for now and avoids a partially written or corrupted file.

lohedges · 2024-10-29T08:13:27Z

The random crash on interpreter exit is caused by the CacheData::cleanUpOnExit method. If I comment this out, then I see no crashes whatsoever. I'll take a look to see if I can fix the problem. It would be nice to fix this as part of #250.

lohedges · 2024-10-29T08:22:49Z

Okay, it's entirely caused by the final bit of the block here where it waits for the threads to stop, then kills them if they don't. It looks like this is due to the 50ms timeout. On my Linux box, increasing this to 100ms causes no crashes. @chryswoods: Is this an acceptable solution? (I could make it even longer if needed.)

chryswoods · 2024-10-29T13:08:40Z

Yes, this is fine. I suspect the threads are being slowed by file removal requests. This is at the interface of threads, system calls and undefined behaviour if killing a thread, so unsurprising it crashes. Waiting longer is the right solution. Maybe we could use some try joins or similar to wait for intervals of 100ms, and if the thread hasn't exited print a message to the user to say what is happening. For example it could be a 10 to 1 countdown with blast off being force termination the thread (and then hoping it doesn't crash)

lohedges · 2024-10-29T13:17:44Z

That's a good idea, I'll do that.

lohedges · 2025-01-17T09:31:21Z

Another update on this. I've checked what's going on in the SystemFrames code and even when calling deleteAllFrames on the system, the underlying frames object is not cleared. For example, when running this script:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()

Gives:

SystemFrames::saveFrame: saved frame. Frame count =  1
SystemFrames::saveFrame: saved frame. Frame count =  2
SystemFrames::saveFrame: saved frame. Frame count =  3
SystemFrames::saveFrame: saved frame. Frame count =  4
SystemFrames::saveFrame: saved frame. Frame count =  5
mols currently has 5 frames
SystemFrames::saveFrame: saved frame. Frame count =  6
SystemFrames::saveFrame: saved frame. Frame count =  7
SystemFrames::saveFrame: saved frame. Frame count =  8
SystemFrames::saveFrame: saved frame. Frame count =  9
SystemFrames::saveFrame: saved frame. Frame count =  10
mols currently has 1 frames
SystemFrames::saveFrame: saved frame. Frame count =  11
SystemFrames::saveFrame: saved frame. Frame count =  12
SystemFrames::saveFrame: saved frame. Frame count =  13
SystemFrames::saveFrame: saved frame. Frame count =  14
SystemFrames::saveFrame: saved frame. Frame count =  15
mols currently has 1 frames

So, calling system.delete_all_frames() breaks the frame counting, but frames are still accumulated in the cache. (I am reporting the size of the QVector containing the PageCache handles, not the number of times the function is called.

lohedges · 2025-01-17T09:52:05Z

Okay, I can get the correct behaviour if I null the SystemTrajectory object when calling System::deleteAllFrames, e.g.:

diff --git a/corelib/src/libs/SireSystem/system.cpp b/corelib/src/libs/SireSystem/system.cpp
index 874d0abb1..868a2b883 100644
--- a/corelib/src/libs/SireSystem/system.cpp
+++ b/corelib/src/libs/SireSystem/system.cpp
@@ -4090,6 +4090,7 @@ void System::deleteAllFrames(const SireBase::PropertyMap &map)
 {
     this->accept();
     this->mustNowRecalculateFromScratch();
+    this->system_trajectory = 0;
     MolGroupsBase::deleteAllFrames(map);
 }

Gives:

SystemFrames::saveFrame: saved frame 0
SystemFrames::saveFrame: saved frame 1
SystemFrames::saveFrame: saved frame 2
SystemFrames::saveFrame: saved frame 3
SystemFrames::saveFrame: saved frame 4
mols currently has 5 frames
SystemFrames::saveFrame: saved frame 0
SystemFrames::saveFrame: saved frame 1
SystemFrames::saveFrame: saved frame 2
SystemFrames::saveFrame: saved frame 3
SystemFrames::saveFrame: saved frame 4
mols currently has 5 frames
SystemFrames::saveFrame: saved frame 0
SystemFrames::saveFrame: saved frame 1
SystemFrames::saveFrame: saved frame 2
SystemFrames::saveFrame: saved frame 3
SystemFrames::saveFrame: saved frame 4
mols currently has 5 frames

This looks like the correct behaviour. I'm just not sure if memory is being freed when nulling in this way.

@mb2055: I'll make this change on my feature_repex branch, push and skip CI, then update the somd2 branch so that you can see if it solves the memory issue.

lohedges · 2025-01-17T09:55:15Z

Note that I'm not sure how to fix the code when calling deleteFrame rather than deleteAllFrames, since there seemingly isn't a way to remove a single frame from the SystemTrajectory object. I guess I'll need to add methods to do this.

lohedges · 2025-01-17T10:00:40Z

I can confirm that the above fix gives the correct number of trajectory frames for somd2 when we delete all frames from within the dynamics object each checkpoint. Just need to see if it solves the memory issue too.

lohedges · 2025-01-17T12:07:37Z

No, it appears the the memory still isn't being freed correctly with this approach. For example, this is the profile for a somd2 repex run. The jumps in memory are during checkpointing where we get the system from the dynamics object, write frames to file, then delete all the frames in the system (within the dynamics object).

lohedges · 2025-01-17T12:41:46Z

Zoomed in it looks like this. There is a big jump in memory each checkpoint, then a small drop again, but not back to the baseline. If all of the jumps in memory were removed, then it would just be a nice smooth trace that plateaued after some period of time.

At the checkpoint we call d.commit() to get a system, then save the frames from that system, then call d._d._sire_mols.delete_all_frames() to clear the trajectory in the dynamics object. This is the bit that I assume isn't working correctly. Previously I had not deleted the frames since this broke the trajectory indexing, so it made sense that the memory would grow each checkpoint since the internal trajectories were getting bigger. Now they are "deleted", but it looks like memory isn't being freed correctly.

lohedges · 2025-01-17T14:24:40Z

These are the profiles for a standard (non-repex) and repex runs for 100ps, checkpointing every 10ps. The standard run is the profile for a single window, whereas repex is all windows (replicas running at once).

From the standard profile we can see that memory grows slowly over time and plateaus. There are small jumps each checkpoint when the trajectory chunks are obtained and saved. I think the larger jump near the end is because we re-assemble the entire trajectory from the chunks, then re-save, so the whole trajectory will be in memory for a period. (Maybe there's a way to combine the files somehow?)

For the repex run, we need to imagine 12 of the standard profiles happening at once. As such, there might be a slight mismatch in the time at which replicas hit the checkpoints so the spikes might not be consistent in time or height. They are also bigger, because it's 12 times the amount of memory. As for the standard case, the jump near the end (final checkpoiint) is a bit bigger since it is assembling the trajectories for each replica, rather than just storing a chunk.

As such, it's not clear to me if things are actually okay. By definition, the repex memory footprint will be higher than the standard case since there are N dynamics objects running at the same time, where Nis the number of replicas. For the standard case, we have M running, where M is the number of GPUs, i.e. the number of concurrent processes.

I'll try to profile for a longer run to get a better idea.

mb2055 added the bug Something isn't working label Jul 25, 2024

lohedges added a commit that referenced this issue Oct 17, 2024

Return clone of molecules. [ref #218]

4cd098e

lohedges mentioned this issue Oct 25, 2024

[BUG] SOMD2 Crashes When Saving A Large System OpenBioSim/somd2#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called. #218

[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called. #218

mb2055 commented Jul 25, 2024

lohedges commented Jul 25, 2024 •

edited

Loading

chryswoods commented Jul 25, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024 •

edited

Loading

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

chryswoods commented Oct 17, 2024

chryswoods commented Oct 17, 2024

lohedges commented Oct 17, 2024

lohedges commented Oct 29, 2024

lohedges commented Oct 29, 2024

chryswoods commented Oct 29, 2024

lohedges commented Oct 29, 2024

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

[BUG] Trajectories are not properly stored after mols.delete_all_frames() is called. #218

[BUG] Trajectories are not properly stored after mols.delete_all_frames() is called. #218

Comments

mb2055 commented Jul 25, 2024

lohedges commented Jul 25, 2024 • edited Loading

chryswoods commented Jul 25, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024 • edited Loading

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

lohedges commented Oct 16, 2024

chryswoods commented Oct 17, 2024

chryswoods commented Oct 17, 2024

lohedges commented Oct 17, 2024

lohedges commented Oct 29, 2024

lohedges commented Oct 29, 2024

chryswoods commented Oct 29, 2024

lohedges commented Oct 29, 2024

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

lohedges commented Jan 17, 2025

[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called. #218

[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called. #218

lohedges commented Jul 25, 2024 •

edited

Loading

lohedges commented Oct 16, 2024 •

edited

Loading