Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathLike wrapper/cache for ExternalStorage #186

Open
wants to merge 64 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
ffdb901
PathLike wrapper/cache for ExternalStorage
dwhswenson Apr 26, 2023
4ad7fb1
Merge branch 'main' into shared-object-v2
dwhswenson Apr 26, 2023
eb19e0f
mypy
dwhswenson Apr 26, 2023
8e6a78c
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Apr 26, 2023
3be13c0
docstrings
dwhswenson Apr 27, 2023
dda02e2
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson May 17, 2023
472151e
Add StorageManager code
dwhswenson May 30, 2023
b692c2f
rename to Stagine
dwhswenson May 30, 2023
7432ff4
Add tests for _delete_empty_dirs
dwhswenson May 30, 2023
319d0d0
Storage docs
dwhswenson May 31, 2023
5650609
outline of storage manager tests
dwhswenson May 31, 2023
ddbbd19
minor improvements on staging directory
dwhswenson May 31, 2023
c5ce48a
first storage lifecycle test works
dwhswenson May 31, 2023
236b263
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson May 31, 2023
b805aac
cleanup mypy
dwhswenson May 31, 2023
ed5e83c
change to unit taking in the label
dwhswenson May 31, 2023
6181039
lots of updates; switched to harness for tests
dwhswenson Jun 5, 2023
1880f73
Big reorg for shared overlapping staging
dwhswenson Jun 6, 2023
1e4ca2c
remove _storage_path_conflict
dwhswenson Jun 6, 2023
a6d26f3
docs & types
dwhswenson Jun 6, 2023
aabbc33
docs, types, logging
dwhswenson Jun 6, 2023
b4d73b3
finish TestHoldingOverlapsPermanentStorageManager
dwhswenson Jun 6, 2023
7af006e
mypy
dwhswenson Jun 6, 2023
58a58bc
test_repr
dwhswenson Jun 6, 2023
8e429f5
renaming around DAGContextManager
dwhswenson Jun 6, 2023
b70df48
holding => staging
dwhswenson Jun 7, 2023
08e3ac2
finish docs (I think?)
dwhswenson Jun 7, 2023
ca7871b
remove completed TODO
dwhswenson Jun 7, 2023
2aa0616
start to testing edge case logging
dwhswenson Jun 9, 2023
7c03dcd
Update stagingdirectory.py
richardjgowers Jun 12, 2023
383075e
tests for single_file_transfer logging
dwhswenson Jun 12, 2023
6365398
tests for read-only transfers
dwhswenson Jun 16, 2023
d35bd60
fix repr and cleanup tests
dwhswenson Jun 16, 2023
7cc10f9
test for permanent transfer to external
dwhswenson Jun 16, 2023
ab025f1
test for Permanent delete staging
dwhswenson Jun 17, 2023
cd70ab2
Add test for missing file on cleanup
dwhswenson Jun 17, 2023
ac1b1d0
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Jun 17, 2023
ea054be
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson Jun 22, 2023
90f2597
get_other_shared to private
dwhswenson Jun 22, 2023
a2e05b2
Merge branch 'main' into shared-object-v2
dwhswenson Jul 6, 2023
80eccc4
Merge branch 'main' into shared-object-v2
dwhswenson Aug 28, 2023
e9ed7a8
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson Sep 8, 2023
b4e1d42
Merge branch 'main' into shared-object-v2
dotsdl Sep 8, 2023
2b070ca
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Sep 9, 2023
4fd4a66
Merge branch 'main' into shared-object-v2
dotsdl Sep 12, 2023
5e56461
Merge branch 'main' into shared-object-v2
dotsdl Sep 19, 2023
73d3a1e
Merge branch 'main' into shared-object-v2
dwhswenson Nov 3, 2023
524cc6e
Merge branch 'main' of github.com:OpenFreeEnergy/gufe into shared-obj…
dwhswenson Dec 1, 2023
dd1b6dc
updates from other branch
dwhswenson Dec 1, 2023
a575dd3
make mypy happy
dwhswenson Dec 1, 2023
ba6fcff
Merge branch 'main' into shared-object-v2
dwhswenson Dec 1, 2023
5d0df5f
pep8
dwhswenson Dec 1, 2023
1418aee
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Dec 1, 2023
e057332
pep8
dwhswenson Dec 1, 2023
1cfe910
StagingDirectory -> StagingRegistry
dwhswenson Dec 4, 2023
78e003b
remove prefix; remove get_other_shared
dwhswenson Dec 7, 2023
265e786
delete_empty_dirs => keep_empty_dirs
dwhswenson Dec 7, 2023
006d787
Merge branch 'main' into shared-object-v2
dwhswenson Dec 11, 2023
ce12326
Add logging to not clean up registered directory
dwhswenson Dec 11, 2023
9afeb2f
Merge branch 'shared-object-v2' of github.com:OpenFreeEnergy/gufe int…
dwhswenson Dec 11, 2023
aaa2aab
StagingPath.fspath => StagingPath.as_path()
dwhswenson Dec 13, 2023
cf60d1b
pep8
dwhswenson Dec 13, 2023
da9955e
remove fspath from StagingRegistry
dwhswenson Dec 22, 2023
8489c24
Merge branch 'main' into shared-object-v2
dwhswenson Jan 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 156 additions & 0 deletions gufe/storage/pseudodirectory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
from typing import Union, Optional
from pathlib import Path
from os import PathLike
from .externalresource import ExternalStorage

import logging
_logger = logging.getLogger(__name__)

class SharedRoot:
richardjgowers marked this conversation as resolved.
Show resolved Hide resolved
"""
Parameters
----------
scratch : os.PathLike
the scratch directory shared by all objects on this host
external : :class:`.ExternalStorage`
external storage resource where objects should eventualy go
prefix : str
label for this specific unit
holding : os.PathLike
name of the subdirectory of scratch where shared results are
temporarily stored; default is '.holding'. This must be the same for
all units within a DAG.
delete_holding : bool
whether to delete the contents of the $SCRATCH/$HOLDING/$PREFIX
directory when this object is deleted
read_only : bool
write to prevent NEW files from being written within this shared
directory. NOTE: This will not prevent overwrite of existing files,
in scratch space, but it will prevent changed files from uploading
to the external storage.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This read_only is mainly to prevent the following mistake:

shared = root.get_other_shared(other_unit_prefix)
with open(shared / "oldfile.txt") as f:
    ... # good so far!

with open(shared  / "newfile.txt", mode='wb') as f:
    ... # uh-oh -- now we're writing into the wrong unit's space!

However, it won't stop open(shared / "oldfile.txt", mode='wb'). You can still overwrite that data. If that data is in a separate permanent storage, it will be fine -- only overwrite locally. (If you set shared up such that the directory it uses is identical to the internal holding cache here, of course, it would overwrite that.)

"""
def __init__(
self,
scratch: PathLike,
external: ExternalStorage,
prefix: str,
*,
holding: PathLike =".holding",
delete_holding: bool = True,
read_only: bool = False,
):
self.external = external
self.scratch = Path(scratch)
self.prefix = Path(prefix)
self.read_only = read_only
self.delete_holding = delete_holding
self.holding = holding

self.registry = set()
# NOTE: the fact that we use $SCRATCH/$HOLDING/$PREFIX instead of
# $SCRATCH/$PREFIX/$HOLDING is important for 2 reasons:
# 1. This doesn't take any of the user's namespace from their
# $SCRATCH/$PREFIX directory.
# 2. This allows us to easily use an external FileStorage where the
# external storage is exactly the same as this local storage,
# meaning that copies to/from the external storage are no-ops.
# Use FileStorage(scratch / holding) for that.
self.shared_dir = self.scratch / holding / prefix
self.shared_dir.mkdir(exist_ok=True, parents=True)

def get_other_shared_dir(self, prefix, delete_holding=None):
"""Get a related unit's shared directory.
"""
if delete_holding is None:
delete_holding = self.delete_holding

return SharedRoot(
scratch=self.scratch,
external=self.external,
prefix=prefix,
holding=self.holding,
delete_holding=delete_holding,
read_only=True,
)

def transfer_holding_to_external(self):
"""Transfer all objects in the registry to external storage"""
if self.read_only:
logging.debug("Read-only: Not transfering to external storage")
return # early exit

for obj in self.registry:
path = Path(obj)
if not path.exists():
logging.info(f"Found nonexistent path {path}, not "
"transfering to external storage")
elif path.is_dir():
logging.debug(f"Found directory {path}, not "
"transfering to external storage")
else:
logging.info(f"Transfering {path} to external storage")
self.external.store_path(obj.label, path)

def __del__(self):
# take everything in self.shared_dir and write to it shared; keeping
# our prefix
self.transfer_holding_to_external()
if self.delete_holding:
shutil.rmtree(self.shared_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc __del__ isn't called on del obj but instead when garbage collection happens. Maybe instead make transfer_holding_to_external always explicit and don't rely on garbage collection related things

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me ... it's an executor-level thing anyway, and I don't mind asking the executor writer to think about how files get moved around (don't want to ask the protocol writer to think about that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe if you want this behaviour, use a context manager, so instead of __del__ use __exit__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This started out as a context manager. I switched it away because it makes actual usage kind of weird -- the entirety of the execution block where this is used gets indented by one level, which I don't particularly like. It gets worse if you have one of these for shared and one for permanent -- there's a syntax for setting multiple context managers at once, but I've also always found that kind of weird.

But I can switch it back to a context manager if you want 🤷‍♂️


def register_path(self, shared_path):
richardjgowers marked this conversation as resolved.
Show resolved Hide resolved
label_exists = self.external.exists(shared_path.label)

if self.read_only and not label_exists:
raise IOError(f"Unable to create '{shared_path.label}'. This "
"shared path is read-only.")

self.registry.add(shared_path)

# if this is a file that exists, bring it into our subdir
# NB: this happens even if you're intending to overwrite the path,
# which is kind of wasteful
if label_exists:
scratch_path = self.shared_dir / shared_path.path
# TODO: switch this to using `get_filename` and `store_path`
with self.external.load_stream(shared_path.label) as f:
external_bytes = f.read()
if scratch_path.exists():
... # TODO: something to check that the bytes are the same?
scratch_path.parent.mkdir(exist_ok=True, parents=True)
with open(scratch_path, mode='wb') as f:
f.write(external_bytes)

def __truediv__(self, path: PathLike):
return SharedPath(root=self, path=path)

def __fspath__(self):
return str(self.shared_dir)

def __repr__(self):
return f"SharedRoot({self.scratch}, {self.external}, {self.prefix})"


class SharedPath:
richardjgowers marked this conversation as resolved.
Show resolved Hide resolved
def __init__(self, root: SharedRoot, path: PathLike):
self.root = root
self.path = Path(path)
self.root.register_path(self)

def __truediv__(self, path):
return SharedPath(self.root, self.path / path)

def __fspath__(self):
return str(self.root.shared_dir / self.path)

@property
def label(self):
return str(self.root.prefix / self.path)

def __repr__(self):
return f"SharedPath({self.__fspath__()})"

# TODO: how much of the pathlib.Path interface do we want to wrap?
# although edge cases may be a pain, we can get most of it with, e.g.:
# def exists(self): return Path(self).exists()
# but also, can do pathlib.Path(shared_path) and get hte whole thing
107 changes: 107 additions & 0 deletions gufe/tests/storage/test_pseudodirectory.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import pytest

import os
import pathlib

from gufe.storage.externalresource import MemoryStorage
from gufe.storage.pseudodirectory import SharedRoot


@pytest.fixture
def root(tmp_path):
external = MemoryStorage()
external.store_bytes("old_unit/data.txt", b"foo")
root = SharedRoot(
scratch=tmp_path,
external=external,
prefix="new_unit",
delete_holding=False
)
return root

@pytest.fixture
def root_with_contents(root):
with open(root / "data.txt", mode='wb') as f:
f.write(b"bar")

return root

class TestSharedRoot:
@pytest.mark.parametrize('pathlist', [
['file.txt'], ['dir', 'file.txt']
])
def test_path(self, root, pathlist):
path = root
for p in pathlist:
path = path / p

inner_path = os.sep.join(pathlist)
actual_path = root.shared_dir / inner_path

assert pathlib.Path(path) == actual_path

def test_read_old(self, root):
# When the file doesn't exist locally, it should be pulled down the
# first time that we register the path.

# initial conditions, without touching SharedRoot/SharedPath
label = "old_unit/data.txt"
on_filesystem = root.scratch / root.holding / label
assert not on_filesystem.exists()
assert root.external.exists(label)

# when we create the specific SharedPath, it registers and
# "downloads" the file
old_shared = root.get_other_shared_dir("old_unit")
filepath = old_shared / "data.txt"
assert pathlib.Path(filepath) == on_filesystem
assert on_filesystem.exists()

# let's just be sure we can read in the data as desired
with open(filepath, mode='rb') as f:
assert f.read() == b"foo"

def test_write_new(self, root):
label = "new_unit/somefile.txt"
on_filesystem = root.scratch / root.holding / label
assert not on_filesystem.exists()
with open(root / "somefile.txt", mode='wb') as f:
f.write(b"testing")

# this has been written to disk in scratch, but not yet saved to
# external storage
assert on_filesystem.exists()
assert not root.external.exists(label)

def test_write_old_fail(self, root):
old_shared = root.get_other_shared_dir("old_unit")
with pytest.raises(IOError, match="read-only"):
old_shared / "foo.txt"

def test_transfer_to_external(self, root_with_contents):
path = list(root_with_contents.registry)[0] # only 1
assert not root_with_contents.external.exists(path.label)

root_with_contents.transfer_holding_to_external()
assert root_with_contents.external.exists(path.label)

with root_with_contents.external.load_stream(path.label) as f:
assert f.read() == b"bar"

def test_transfer_to_external_no_file(self, root):
...

def test_tranfer_to_external_directory(self, root):
...

def test_del(self):
...

def test_existing_local_and_external(self, root):
...

def test_existing_local_and_external_conflict(self, root):
...

def test_no_transfer_for_read_only(self, root):
...