-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Storage manager update #679
base: main
Are you sure you want to change the base?
Conversation
Hello @richardjgowers! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2024-01-22 14:33:52 UTC |
currently fails with: UnsupportedOperation: not readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick overview of current status:
Main thing is that I think you're using file.as_path()
more frequently than needed. This should be used primarily for avoiding the automatic registration that comes from pathlib.Path(file)
, especially for files that already exist in the cloud (user story: don't download a big file from the cloud when all I want is its parent directory, so I use .as_path()
).
When the file doesn't exist, using just pathlib.Path(file)
will mark it as something to transfer to the appropriate external storage. Mainly, you don't want to do this on files that you delete during the unit (although those really should have been in scratch
, not in either shared
or permanent
). I think it will also do something on transfer if file
in that was a directory (either warning or info-level log message; can't remember which).
You're also using str(file.as_path())
frequently. (1) I think in most cases you actually want str(pathlib.Path(file))
, which invokes registration, so you don't need to do it manually; and (2) we should probably implement StagingPath.__str__(self): return str(pathlib.Path(self))
, so that your usage would actually just be str(file)
.
shared_basepath: stagingregistry.StagingRegistry, | ||
permanent_basepath: stagingregistry.StagingRegistry, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are StagingRegistry
... aren't they StagingPath
?
The input to this method was, e.g., root/$DAG_LABEL/shared_$UNIT_LABEL
. StagingRegistry
would correspond to root/
. The full path comes in as a StagingPath
, I believe (this is something that changed at one point, because there was no need to expose both classes to protocol developers; it also simplified serialization, IIRC.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working off these hints here, these three vars are just the context unpacked: https://github.com/OpenFreeEnergy/gufe/blob/staging-execute-dag/gufe/protocols/protocolunit.py#L30
scratch_basepath: StagingDirectory | ||
Where to store temporary files | ||
shared_basepath : StagingDirectory | ||
Where to run the calculation | ||
permanent_basepath : StagingDirectory | ||
Where to store files that must persist beyond the DAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update typing in docstring
@@ -664,11 +664,13 @@ def run(self, *, dry=False, verbose=True, | |||
else: | |||
ffcache = None | |||
|
|||
ffcache.register() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem 1 (easy): this line and the ffcache.as_path()
below will fail if ffcache is None
.
Problem 2 (more complicated): How is ffcache
supposed to work? Isn't that supposed to be something that we can pull down from some previous unit? shared_basepath
is supposed to be for a specific execution of a given unit. That directory might not exist until we're about to start executing this unit. How does a file get in there? How are we currently using that? [EDIT: I'm having a call with @IAlibay tomorrow to clarify what is going on with this]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reading through, I think the only thing that would be cached would be the LJ parameters for the small molecules. We can probably remove this?
chk = sim_settings.checkpoint_storage | ||
reporter = multistate.MultiStateReporter( | ||
storage=nc, | ||
storage=str(nc.as_path()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're using this str(file.as_path())
pattern a lot. When you use .as_path()
, you're explicitly bypassing the automatic registration mechanism, meaning that you must explicitly register the object.
I'm not sure if this is implemented, but you should be able to instead just use str(file)
, which should also register the file. Equivalently, str(pathlib.Path(file))
will definitely work, which means you don't need to manually use .register()
.
You should only need to use .register()
on paths that are created outside your control (like the checkpoint and RTA files, which are paths generated inside OpenMMTools and not exposed to us).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah once it's MultiStateReporter wants a string (not path) and the other is going into a subprocess call, so again it's bypassing file-like objects and is a string.
'nc': nc.as_path(), | ||
'last_checkpoint': checkpoint.as_path(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to return the StagingPath
itself, which can then be reloaded even if you move the path. Imagine something running on the EC2, where files will get stored on S3. Knowing the path on your EC2 instance isn't particularly useful.
**analyzer.unit_results_dict | ||
} | ||
else: | ||
return {'debug': {'sampler': sampler}} | ||
|
||
@staticmethod | ||
def analyse(where) -> dict: | ||
def analyse(where: stagingregistry.StagingRegistry) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably a StagingPath
# don't put energy analysis in here, it uses the open file reporter | ||
# whereas structural stuff requires that the file handle is closed | ||
ret = subprocess.run(['openfe_analysis', str(where)], | ||
trjdir = (where / '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the question I was asking on slack. I needed a .as_path()
of a directory and StagingRegistry
doesn't have this
trjdir = (where / '') | ||
output = (where / 'results.json') | ||
ret = subprocess.run(['openfe_analysis', 'RFE_analysis', | ||
str(trjdir.as_path()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, has the requirement that the files are already in the directory trjdir
. Note that this will only work for situations where this analyse
method is called on the same computer where the data was generated.
To run it on a different computer would require downloading all files within the directory. The best way to do that now would be to enumerate them here -- in principle there is a way to obtain all files with a given string prefix, but all of that gets deep into the weeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is called from _execute()
right after run()
so it should always be the same machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a little ugly that it needs to know if the trj was stored in scratch/shared/permanent, so there's probably something there I need to fix
plt.close(f2) | ||
|
||
f3 = plotting.plot_ligand_RMSD(data['time(ps)'], data['ligand_RMSD']) | ||
f3.savefig(savedir / "ligand_RMSD.png") | ||
f3.savefig(where / "ligand_RMSD.png") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not really this line, but this is the closest changed line)
You may want to include the figure filenames in the return value, too. Otherwise it will be harder to download them from the cloud.
[no ci]
fixes #292
Developers certificate of origin