Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Trajectories are not properly stored after mols.delete_all_frames() is called. #218

Open
mb2055 opened this issue Jul 25, 2024 · 23 comments
Labels
bug Something isn't working

Comments

@mb2055
Copy link
Contributor

mb2055 commented Jul 25, 2024

Describe the bug
When periodically deleting the frames contained in a system (as is done in SOMD2 when checkpointing), any trajectories contained in a system after calling mols.delete_all_frames() will only contain a single frame. This issue also seems to cause a segmentation fault, but only around 50% of the time.

To Reproduce
Run the following script:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

the output will be the following:

mols currently has 5 frames
mols currently has 1 frames
mols currently has 1 frames
FATAL: exception not rethrown
Aborted (core dumped)

Expected behavior
The output should be

mols currently has 5 frames
mols currently has 5 frames
mols currently has 5 frames

(please complete the following information):

  • OS: Ubuntu 22.04.4 LTS
  • Version of Python: 3.12.4
  • Version of sire: 2024.3.0.dev
  • I confirm that I have checked this bug still exists in the latest released version of sire: yes

Additional context
Note that everything works as expected if mols.delete_all_frames() is not called:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames
@mb2055 mb2055 added the bug Something isn't working label Jul 25, 2024
@lohedges
Copy link
Contributor

lohedges commented Jul 25, 2024

Just to add that I have also been seeing...

FATAL: exception not rethrown
Aborted (core dumped)

about 50% of the time on interpreter shutdown on my feature_emle branch having merged in devel following the 2024.2.0 release. In this case I am not calling .delete_all_frames() anywhere. I just run a short block of dynamics and get the energy trajectory.

@chryswoods
Copy link
Contributor

That is interesting - the bug is likely happening because the cache is being improperly shared between the copy of mols in d, and the copy of mols that was output by d.commit(). In theory, it should not be shared, and so running mols.delete_all_frames() should not be deleting all frames for the mols in d, and so the right output from your script should be

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

Let me think about this and investigate further.

@lohedges
Copy link
Contributor

If you use the return_as_system=True kwarg to d.commit() then you can see that the underlying system (for the dynamics object) does still contain all of the frames, i.e.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit(return_as_system=True)
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

This still has issues with random crashes on interpreter exit, though.

Looking at the logic in the d.commit() it seems that the issue must be during the update of the original system:

    def commit(self, return_as_system: bool = False):
        if self.is_null():
            return

        self._update_from(
            state=self._get_current_state(include_coords=True, include_velocities=True),
            state_has_cv=(True, True),
            nsteps_completed=self._current_step,
        )

        self._sire_mols.set_energy_trajectory(self._energy_trajectory, map=self._map)

        self._sire_mols.set_ensemble(self.ensemble())

        if return_as_system:
            return self._sire_mols.clone()

        elif self._orig_mols is not None:
            from ..system import System

            if System.is_system(self._orig_mols):
                return self._sire_mols
            else:
                r = self._orig_mols.clone()
                r.update(self._sire_mols.molecules())
                return r
        else:
            return self._sire_mols.clone()

I'll take a look to see what's going wrong.

@lohedges
Copy link
Contributor

Okay, this seems to fix it:

diff --git a/src/sire/mol/_dynamics.py b/src/sire/mol/_dynamics.py
index cdb8a56b..965a67f9 100644
--- a/src/sire/mol/_dynamics.py
+++ b/src/sire/mol/_dynamics.py
@@ -1144,7 +1144,7 @@ class DynamicsData:
             from ..system import System

             if System.is_system(self._orig_mols):
-                return self._sire_mols
+                return self._sire_mols.clone()
             else:
                 r = self._orig_mols.clone()
                 r.update(self._sire_mols.molecules())

@chryswoods: Let me know if this is the correct solution.

I'll look at how to actually delete the frames for the purposes of chunking trajectories in somd2.

@lohedges
Copy link
Contributor

lohedges commented Oct 16, 2024

Hmm, but if you delete the frames from the underlying system object, then it doesn't work as expected again:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 0 frames
mols currently has 0 frames

@lohedges
Copy link
Contributor

Yeah, there's weirdness going on. For example, if you delete a single frame (the first frame in this case, but it doesn't matter):

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 4 frames
mols currently has 3 frames

Deleting two frames:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 3 frames
mols currently has 1 frames

I think deleting all frames must close the trajectory file handle somehow, so you are only ever left with one frame. That said, the coordinates of the molecule do appear to be updated, e.g.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()
    print(mols.coordinates()

Gives:

mols currently has 5 frames
( 13.3499 Å, 14.5876 Å, 12.5228 Å )
mols currently has 1 frames
( 13.4061 Å, 14.5797 Å, 12.6688 Å )
mols currently has 1 frames
( 13.2908 Å, 14.5102 Å, 12.5739 Å )

@lohedges
Copy link
Contributor

This is what happens when I print out the call to System::saveFrame:

System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 5 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames

So the function to save the frames is being called the correct number of times. I think the number of frames must be inconsistent somewhere.

@lohedges
Copy link
Contributor

Ah, I see, there's a SystemFrames object that stores snapshots from the live trajectory. I think this isn't in sync with the trajectory property in the system somehow. In any case, it's easy to adapt the chunking since, I believe, the trajectory is no longer streamed to file. I can just track the current number of frames in the system and only write the new frames to the chunk. I'll just do this and we can figure out the underlying issue here at a later point if it's too fiddly.

@lohedges
Copy link
Contributor

Yes, I've updated somd2 and it works perfectly. Can revisit this at a later date if needed.

@chryswoods
Copy link
Contributor

Yes, this looks like the issue. There's a lot of challenge and complexity in this code because it mixes objects which are copy on write (System, Molecule) with objects that appear to be copy on write, but are actually explicitly shared (Trajectory contents) and hidden objects that are explicitly shared and (almost) globally cached (the SystemFrames that are stored in the cache. To make it more complex, the cache auto-offloads to disk as each cache frame fills up.

The code was tested well for the normal case (generating a trajectory and saving it) but there's a lot of edge case bugs I guess when doing things like deleting individual trajectory frames from on implicitly shared copy, while the other copy needs a trajectory with all the frames in their original order. In these cases the code has to do a lot of juggling and memory movement.

It would be worth at a future point looking at what somd2 really needs in terms of trajectory management and adding some custom code that implements that scheme (e.g. in C++, so that the trajectory doesn't need to be copied or excessively moved around).

@chryswoods
Copy link
Contributor

Also, correct, the trajectory isn't now streamed to a file. The frames are held in a cache that auto-offloads to disk as needed. You only write to a file when you explicitly save the trajectory. The cache keeps about 512MB of trajectory in memory, if I remember correctly, with the rest going to a page cache on disk. The cache is very efficient, memory mapping pages and giving you good random access performance to individual frames (each frame is paged individually). There's also an intermediary cache and a little look ahead so that frames are pulled from disk to memory efficiently, and stay there as long as needed. In general, it should mean that you don't need to do manual trajectory management now during a simulation or have to worry about filling memory. Instead, you can choose what frames to save after the simulation is complete.

@lohedges
Copy link
Contributor

Great. This is what I'm now doing, i.e. writing the last N frames as a chunk at a checkpoint, then assembling the whole trajectory at the end. I guess I could probably keep the file open throughout and append, but this works well for now and avoids a partially written or corrupted file.

@lohedges
Copy link
Contributor

The random crash on interpreter exit is caused by the CacheData::cleanUpOnExit method. If I comment this out, then I see no crashes whatsoever. I'll take a look to see if I can fix the problem. It would be nice to fix this as part of #250.

@lohedges
Copy link
Contributor

Okay, it's entirely caused by the final bit of the block here where it waits for the threads to stop, then kills them if they don't. It looks like this is due to the 50ms timeout. On my Linux box, increasing this to 100ms causes no crashes. @chryswoods: Is this an acceptable solution? (I could make it even longer if needed.)

@chryswoods
Copy link
Contributor

Yes, this is fine. I suspect the threads are being slowed by file removal requests. This is at the interface of threads, system calls and undefined behaviour if killing a thread, so unsurprising it crashes. Waiting longer is the right solution. Maybe we could use some try joins or similar to wait for intervals of 100ms, and if the thread hasn't exited print a message to the user to say what is happening. For example it could be a 10 to 1 countdown with blast off being force termination the thread (and then hoping it doesn't crash)

@lohedges
Copy link
Contributor

That's a good idea, I'll do that.

@lohedges
Copy link
Contributor

Another update on this. I've checked what's going on in the SystemFrames code and even when calling deleteAllFrames on the system, the underlying frames object is not cleared. For example, when running this script:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()

Gives:

SystemFrames::saveFrame: saved frame. Frame count =  1
SystemFrames::saveFrame: saved frame. Frame count =  2
SystemFrames::saveFrame: saved frame. Frame count =  3
SystemFrames::saveFrame: saved frame. Frame count =  4
SystemFrames::saveFrame: saved frame. Frame count =  5
mols currently has 5 frames
SystemFrames::saveFrame: saved frame. Frame count =  6
SystemFrames::saveFrame: saved frame. Frame count =  7
SystemFrames::saveFrame: saved frame. Frame count =  8
SystemFrames::saveFrame: saved frame. Frame count =  9
SystemFrames::saveFrame: saved frame. Frame count =  10
mols currently has 1 frames
SystemFrames::saveFrame: saved frame. Frame count =  11
SystemFrames::saveFrame: saved frame. Frame count =  12
SystemFrames::saveFrame: saved frame. Frame count =  13
SystemFrames::saveFrame: saved frame. Frame count =  14
SystemFrames::saveFrame: saved frame. Frame count =  15
mols currently has 1 frames

So, calling system.delete_all_frames() breaks the frame counting, but frames are still accumulated in the cache. (I am reporting the size of the QVector containing the PageCache handles, not the number of times the function is called.

@lohedges
Copy link
Contributor

Okay, I can get the correct behaviour if I null the SystemTrajectory object when calling System::deleteAllFrames, e.g.:

diff --git a/corelib/src/libs/SireSystem/system.cpp b/corelib/src/libs/SireSystem/system.cpp
index 874d0abb1..868a2b883 100644
--- a/corelib/src/libs/SireSystem/system.cpp
+++ b/corelib/src/libs/SireSystem/system.cpp
@@ -4090,6 +4090,7 @@ void System::deleteAllFrames(const SireBase::PropertyMap &map)
 {
     this->accept();
     this->mustNowRecalculateFromScratch();
+    this->system_trajectory = 0;
     MolGroupsBase::deleteAllFrames(map);
 }

Gives:

SystemFrames::saveFrame: saved frame 0
SystemFrames::saveFrame: saved frame 1
SystemFrames::saveFrame: saved frame 2
SystemFrames::saveFrame: saved frame 3
SystemFrames::saveFrame: saved frame 4
mols currently has 5 frames
SystemFrames::saveFrame: saved frame 0
SystemFrames::saveFrame: saved frame 1
SystemFrames::saveFrame: saved frame 2
SystemFrames::saveFrame: saved frame 3
SystemFrames::saveFrame: saved frame 4
mols currently has 5 frames
SystemFrames::saveFrame: saved frame 0
SystemFrames::saveFrame: saved frame 1
SystemFrames::saveFrame: saved frame 2
SystemFrames::saveFrame: saved frame 3
SystemFrames::saveFrame: saved frame 4
mols currently has 5 frames

This looks like the correct behaviour. I'm just not sure if memory is being freed when nulling in this way.

@mb2055: I'll make this change on my feature_repex branch, push and skip CI, then update the somd2 branch so that you can see if it solves the memory issue.

@lohedges
Copy link
Contributor

Note that I'm not sure how to fix the code when calling deleteFrame rather than deleteAllFrames, since there seemingly isn't a way to remove a single frame from the SystemTrajectory object. I guess I'll need to add methods to do this.

@lohedges
Copy link
Contributor

I can confirm that the above fix gives the correct number of trajectory frames for somd2 when we delete all frames from within the dynamics object each checkpoint. Just need to see if it solves the memory issue too.

@lohedges
Copy link
Contributor

No, it appears the the memory still isn't being freed correctly with this approach. For example, this is the profile for a somd2 repex run. The jumps in memory are during checkpointing where we get the system from the dynamics object, write frames to file, then delete all the frames in the system (within the dynamics object).

Image

@lohedges
Copy link
Contributor

Zoomed in it looks like this. There is a big jump in memory each checkpoint, then a small drop again, but not back to the baseline. If all of the jumps in memory were removed, then it would just be a nice smooth trace that plateaued after some period of time.

At the checkpoint we call d.commit() to get a system, then save the frames from that system, then call d._d._sire_mols.delete_all_frames() to clear the trajectory in the dynamics object. This is the bit that I assume isn't working correctly. Previously I had not deleted the frames since this broke the trajectory indexing, so it made sense that the memory would grow each checkpoint since the internal trajectories were getting bigger. Now they are "deleted", but it looks like memory isn't being freed correctly.

Image

@lohedges
Copy link
Contributor

These are the profiles for a standard (non-repex) and repex runs for 100ps, checkpointing every 10ps. The standard run is the profile for a single window, whereas repex is all windows (replicas running at once).

From the standard profile we can see that memory grows slowly over time and plateaus. There are small jumps each checkpoint when the trajectory chunks are obtained and saved. I think the larger jump near the end is because we re-assemble the entire trajectory from the chunks, then re-save, so the whole trajectory will be in memory for a period. (Maybe there's a way to combine the files somehow?)

For the repex run, we need to imagine 12 of the standard profiles happening at once. As such, there might be a slight mismatch in the time at which replicas hit the checkpoints so the spikes might not be consistent in time or height. They are also bigger, because it's 12 times the amount of memory. As for the standard case, the jump near the end (final checkpoiint) is a bit bigger since it is assembling the trajectories for each replica, rather than just storing a chunk.

As such, it's not clear to me if things are actually okay. By definition, the repex memory footprint will be higher than the standard case since there are N dynamics objects running at the same time, where Nis the number of replicas. For the standard case, we have M running, where M is the number of GPUs, i.e. the number of concurrent processes.

I'll try to profile for a longer run to get a better idea.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants