Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve Full Trajectory Information in Parallel Analysis #4891

Open
yuxuanzhuang opened this issue Jan 13, 2025 · 4 comments · May be fixed by #4892
Open

Preserve Full Trajectory Information in Parallel Analysis #4891

yuxuanzhuang opened this issue Jan 13, 2025 · 4 comments · May be fixed by #4892

Comments

@yuxuanzhuang
Copy link
Contributor

yuxuanzhuang commented Jan 13, 2025

Is your feature request related to a problem?

In analyses e.g. DistanceMatrix in diffusionmap, we need to run analysis of a single frame over the full trajectory---that is sliced and defined by start, stop, and step in run(). However, in the current parallel analysis implementation, information about the full trajectory is lost (only per-process slices are visible).

Describe the solution you’d like

Firstly, in _setup_frames (which only runs in the main process), store the sliced trajectory information in self._global_slicer. This makes it possible to retrieve global details such as the total number of frames.

Secondly, in self._compute, also store self._global_frame_index so the analysis can track the global frame index for each frame, rather than just per-process slices.

The following code will illustrate its usage

import MDAnalysis as mda 
from MDAnalysisTests.datafiles import PDB, XTC
import numpy as np
from MDAnalysis.analysis.base import AnalysisBase
from MDAnalysis.analysis.results import ResultsGroup

u = mda.Universe(PDB, XTC)

class MyAnalysis(AnalysisBase):
    _analysis_algorithm_is_parallelizable = True

    @classmethod
    def get_supported_backends(cls):
        return ('serial', 'multiprocessing', 'dask',)

    def _prepare(self):
       self.results.frame_index = []
       self.results.global_frame_index = []
       self.results.n_frames = []
       self.results.global_n_frames = []

    def _single_frame(self):
       frame_index = self._frame_index
       global_frame_index = self._global_frame_index

       self.results.frame_index.append(frame_index)
       self.results.global_frame_index.append(global_frame_index)

       self.results.n_frames.append(self.n_frames)

       global_n_frames = len(self._trajectory[self._global_slicer])
       self.results.global_n_frames.append(global_n_frames)

    def _get_aggregator(self):
      return ResultsGroup(lookup={'frame_index': ResultsGroup.flatten_sequence,
                                  'global_frame_index': ResultsGroup.flatten_sequence,
                                  'n_frames': ResultsGroup.flatten_sequence,
                                  'global_n_frames': ResultsGroup.flatten_sequence})

# serial analysis
ana = MyAnalysis(u.trajectory)
ana.run(step=2)
ana.results
> {'frame_index': [0, 1, 2, 3, 4], 'global_frame_index': [0, 1, 2, 3, 4], 'n_frames': [5, 5, 5, 5, 5], 'global_n_frames': [5, 5, 5, 5, 5]}

# parallel analysis
ana = MyAnalysis(u.trajectory)
ana.run(step=2, backend='dask', n_workers=2)
ana.results
> {'frame_index': [0, 1, 2, 0, 1], 'global_frame_index': [0, 1, 2, 3, 4], 'n_frames': [3, 3, 3, 2, 2], 'global_n_frames': [5, 5, 5, 5, 5]}

n_frames and frame_index are not the same for serial and parallel analysis

Related PR

#4745

@marinegor
Copy link
Contributor

@yuxuanzhuang just wanted to say that I saw the PR and plan to review it tomorrow or on Saturday.

as for the _global part, I can think of two solutions:

  • have them implemented as properties with some documentation (instead of pure attributes), since as I understand, the current solution adds attributes and doesn't really allow to e.g. read the documentaion on them directly (instead you have to go read the actual docs, which is one step further away from jupyter's run.attribute?)
  • add an attribute constants or something, that would keep track of all readonly properties that the reader assigns during frame reading, and list all of them there

The last one is a bigger change though, but it also seems that there are a lot of different properties already, that might be worth unifying under a common class.

@orbeckst
Copy link
Member

orbeckst commented Feb 5, 2025

I like the idea of the global constants (or other name... "COMMON", anyone ;-) ), similar to results, as a name space for global variables. At least then it looks less messy.

@marinegor
Copy link
Contributor

Hi @yuxuanzhuang,

my apologies for delaying the review. I have looked at the PR and decided to take the discussion here.

First, I completely understand the rationale behind this, and I agree that it must be implemented. However, there are indeed some issues with the naming, but perhaps generally with how AnalysisBase separates run constants from all other attributes that might come along (it really doesn't, only with _).

Ideally (and also recalling my GSOC days), I'd love to have all the current-timestamp-related attributes in one namespace (either as a nested attribute, or simply with the same prefix), but that is too much to ask, and will break stuff. But we still can do it here -- the _global prefix is actually a good step in that direction.

My problem with _global was that at first, I didn't really understand, if this index/slicer is global (i.e. "for everything"), what is this "everything". Since it represents the parameters of the whole run, I'd simply call it _run_slicer and _run_frame_index, because that's what it's defined by -- the run parameters.

A sub-suggestion is to have something like _computational_group_frame_index and _computational_group_slicer, though that's way too wordy and also duplicates existing attributes. But if you have some suggestions here, would be double cool.

But, I'd approach the documentation from that run/computational-group standpoint: there are two groups of parameters, whole-run-related and computational-group-related, and I'd explicitly list them as such in the docs.


So action points:

  • rename _global_... to _run_...
  • add documentation section explicitly listing computational group attributes (all old ones) vs whole run attributes (_run_... ones)
  • long-term: think about how implement this run/computation-group namespaces

For the last one, perhaps making indeed self.constants.{run,computational_group} namespaces would be ok, and then the computational_group ones would link to existing attributes, while run would be new. But generally I'm ok with having things as they are.

@marinegor
Copy link
Contributor

Oh, and the other thing -- if the sole purpose of that is to allow people who write custom parallelization code to iterate through the trajectory independently of the current slicer, why don't you write it as a custom method, instead of making them do self._trajectory[self._global_slicer]? That could be slightly cleaner, I guess, and also would have documentation, whereas the indexing of trajectory paradigm would be less discoverable in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants