Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concept of managed datasets for create/update/delete #56

Open
mortenpi opened this issue Nov 10, 2022 · 0 comments
Open

Concept of managed datasets for create/update/delete #56

mortenpi opened this issue Nov 10, 2022 · 0 comments

Comments

@mortenpi
Copy link
Member

One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.

A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?

One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.

To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.

For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local Data.toml-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:

my-data-project/Project.toml
               /Data.toml
               /.datasets/<uuid1-for-File>
               /.datasets/<uuid2-for-FileTree>/foo.csv
               /.datasets/<uuid2-for-FileTree>/bar.csv

If now you create a dataset in a local project from a file with something like

DataSets.create("new-ds-name", "local/file.csv")

it will generate a UUID for it and just copy it to .datasets/<uuid>. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.

A few closing thoughts:

  • A data repo might not support managed datasets at all. That's fine, you just can't create/update/delete datasets then, just read existing ones. It may also have some datasets that are unmanaged, even if it otherwise does support them.
  • All "linked" datasets in a TOML file would be unmanaged, and hence read-only. It would even be worth implementing them via a separate storage driver, in order not to conflate it with the implementation for standard datasets. Not sure about an API for creating such a dataset -- it probably would have to be specific to a data repo, because such a dataset only make sense for some repositories.
  • You might be able to convert linked datasets into managed ones though, which will copy it to the repositories storage (whatever that may be).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant