You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.
A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?
One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.
To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.
For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local Data.toml-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:
If now you create a dataset in a local project from a file with something like
DataSets.create("new-ds-name", "local/file.csv")
it will generate a UUID for it and just copy it to .datasets/<uuid>. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.
A few closing thoughts:
A data repo might not support managed datasets at all. That's fine, you just can't create/update/delete datasets then, just read existing ones. It may also have some datasets that are unmanaged, even if it otherwise does support them.
All "linked" datasets in a TOML file would be unmanaged, and hence read-only. It would even be worth implementing them via a separate storage driver, in order not to conflate it with the implementation for standard datasets. Not sure about an API for creating such a dataset -- it probably would have to be specific to a data repo, because such a dataset only make sense for some repositories.
You might be able to convert linked datasets into managed ones though, which will copy it to the repositories storage (whatever that may be).
The text was updated successfully, but these errors were encountered:
One issue that is see with implementing create/update/delete operations (#31, #38) here in DataSets is that different data repositories may have a very different ideas of how to execute them, and may require repository-specific information.
A case in point: the TOML-based data repos generally just link to data. Should deletion delete the linked file, or just the metadata? If you create a new dataset, where will the file be? Do you need to pass some options?
One design goal of DataSets is that it provides a universal, relocatable interface. So if you create datasets in a script, that should work consistently, even if you move to a different repository. But if you have to pass repository-specific options, that breaks that principle.
To provide create/update/delete functionality in a generic way, we could have the notion of managed datasets. Basically, the data repository fully owns and controls the storage. When you create a dataset, you essentially just hand it over to the repository, and as the user you can not exercise any more control in your script.
For a remote, managed storage of datasets, this is how it must work by definition. But we should also have this for the local
Data.toml
-based repositories. I imagine that your repository would manage a directory somewhere where the data actually gets store, e.g.:If now you create a dataset in a local project from a file with something like
it will generate a UUID for it and just copy it to
.datasets/<uuid>
. This way we also do not have any problems with e.g. trying to infer destination file names and running into conflicts.A few closing thoughts:
The text was updated successfully, but these errors were encountered: