Preserve Full Trajectory Information in Parallel Analysis #4891

yuxuanzhuang · 2025-01-13T23:57:17Z

Is your feature request related to a problem?

In analyses e.g. DistanceMatrix in diffusionmap, we need to run analysis of a single frame over the full trajectory---that is sliced and defined by start, stop, and step in run(). However, in the current parallel analysis implementation, information about the full trajectory is lost (only per-process slices are visible).

Describe the solution you’d like

Firstly, in _setup_frames (which only runs in the main process), store the sliced trajectory information in self._global_slicer. This makes it possible to retrieve global details such as the total number of frames.

Secondly, in self._compute, also store self._global_frame_index so the analysis can track the global frame index for each frame, rather than just per-process slices.

The following code will illustrate its usage

import MDAnalysis as mda 
from MDAnalysisTests.datafiles import PDB, XTC
import numpy as np
from MDAnalysis.analysis.base import AnalysisBase
from MDAnalysis.analysis.results import ResultsGroup

u = mda.Universe(PDB, XTC)

class MyAnalysis(AnalysisBase):
    _analysis_algorithm_is_parallelizable = True

    @classmethod
    def get_supported_backends(cls):
        return ('serial', 'multiprocessing', 'dask',)

    def _prepare(self):
       self.results.frame_index = []
       self.results.global_frame_index = []
       self.results.n_frames = []
       self.results.global_n_frames = []

    def _single_frame(self):
       frame_index = self._frame_index
       global_frame_index = self._global_frame_index

       self.results.frame_index.append(frame_index)
       self.results.global_frame_index.append(global_frame_index)

       self.results.n_frames.append(self.n_frames)

       global_n_frames = len(self._trajectory[self._global_slicer])
       self.results.global_n_frames.append(global_n_frames)

    def _get_aggregator(self):
      return ResultsGroup(lookup={'frame_index': ResultsGroup.flatten_sequence,
                                  'global_frame_index': ResultsGroup.flatten_sequence,
                                  'n_frames': ResultsGroup.flatten_sequence,
                                  'global_n_frames': ResultsGroup.flatten_sequence})

# serial analysis
ana = MyAnalysis(u.trajectory)
ana.run(step=2)
ana.results
> {'frame_index': [0, 1, 2, 3, 4], 'global_frame_index': [0, 1, 2, 3, 4], 'n_frames': [5, 5, 5, 5, 5], 'global_n_frames': [5, 5, 5, 5, 5]}

# parallel analysis
ana = MyAnalysis(u.trajectory)
ana.run(step=2, backend='dask', n_workers=2)
ana.results
> {'frame_index': [0, 1, 2, 0, 1], 'global_frame_index': [0, 1, 2, 3, 4], 'n_frames': [3, 3, 3, 2, 2], 'global_n_frames': [5, 5, 5, 5, 5]}

n_frames and frame_index are not the same for serial and parallel analysis

Related PR

#4745

The text was updated successfully, but these errors were encountered:

marinegor · 2025-01-15T18:42:22Z

@yuxuanzhuang just wanted to say that I saw the PR and plan to review it tomorrow or on Saturday.

as for the _global part, I can think of two solutions:

have them implemented as properties with some documentation (instead of pure attributes), since as I understand, the current solution adds attributes and doesn't really allow to e.g. read the documentaion on them directly (instead you have to go read the actual docs, which is one step further away from jupyter's run.attribute?)
add an attribute constants or something, that would keep track of all readonly properties that the reader assigns during frame reading, and list all of them there

The last one is a bigger change though, but it also seems that there are a lot of different properties already, that might be worth unifying under a common class.

orbeckst · 2025-02-05T01:12:12Z

I like the idea of the global constants (or other name... "COMMON", anyone ;-) ), similar to results, as a name space for global variables. At least then it looks less messy.

marinegor · 2025-02-19T23:54:06Z

Hi @yuxuanzhuang,

my apologies for delaying the review. I have looked at the PR and decided to take the discussion here.

First, I completely understand the rationale behind this, and I agree that it must be implemented. However, there are indeed some issues with the naming, but perhaps generally with how AnalysisBase separates run constants from all other attributes that might come along (it really doesn't, only with _).

Ideally (and also recalling my GSOC days), I'd love to have all the current-timestamp-related attributes in one namespace (either as a nested attribute, or simply with the same prefix), but that is too much to ask, and will break stuff. But we still can do it here -- the _global prefix is actually a good step in that direction.

My problem with _global was that at first, I didn't really understand, if this index/slicer is global (i.e. "for everything"), what is this "everything". Since it represents the parameters of the whole run, I'd simply call it _run_slicer and _run_frame_index, because that's what it's defined by -- the run parameters.

A sub-suggestion is to have something like _computational_group_frame_index and _computational_group_slicer, though that's way too wordy and also duplicates existing attributes. But if you have some suggestions here, would be double cool.

But, I'd approach the documentation from that run/computational-group standpoint: there are two groups of parameters, whole-run-related and computational-group-related, and I'd explicitly list them as such in the docs.

So action points:

rename _global_... to _run_...
add documentation section explicitly listing computational group attributes (all old ones) vs whole run attributes (_run_... ones)
long-term: think about how implement this run/computation-group namespaces

For the last one, perhaps making indeed self.constants.{run,computational_group} namespaces would be ok, and then the computational_group ones would link to existing attributes, while run would be new. But generally I'm ok with having things as they are.

marinegor · 2025-02-20T09:30:14Z

Oh, and the other thing -- if the sole purpose of that is to allow people who write custom parallelization code to iterate through the trajectory independently of the current slicer, why don't you write it as a custom method, instead of making them do self._trajectory[self._global_slicer]? That could be slightly cleaner, I guess, and also would have documentation, whereas the indexing of trajectory paradigm would be less discoverable in the docs.

yuxuanzhuang linked a pull request Jan 14, 2025 that will close this issue

Preserve Full Trajectory Information in AnalysisBase #4892

Open

5 tasks

orbeckst added Component-Analysis parallelization labels Jan 14, 2025

orbeckst assigned yuxuanzhuang Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve Full Trajectory Information in Parallel Analysis #4891

Preserve Full Trajectory Information in Parallel Analysis #4891

yuxuanzhuang commented Jan 13, 2025 •

edited

Loading

marinegor commented Jan 15, 2025

orbeckst commented Feb 5, 2025

marinegor commented Feb 19, 2025

marinegor commented Feb 20, 2025

Preserve Full Trajectory Information in Parallel Analysis #4891

Preserve Full Trajectory Information in Parallel Analysis #4891

Comments

yuxuanzhuang commented Jan 13, 2025 • edited Loading

Is your feature request related to a problem?

Describe the solution you’d like

Related PR

marinegor commented Jan 15, 2025

orbeckst commented Feb 5, 2025

marinegor commented Feb 19, 2025

marinegor commented Feb 20, 2025

yuxuanzhuang commented Jan 13, 2025 •

edited

Loading