Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRUD updates, dataset mutation and BlobTree API updates #38

Closed
wants to merge 3 commits into from

Conversation

c42f
Copy link
Contributor

@c42f c42f commented Apr 26, 2022

This is a big batch of changes, implementing

  • A big rewrite of the BlobTree API to make it more coherent and simpler
  • Allow a dataset to be opened for mutation with open(write=true).
  • A CRUD interface for modifying DataProject
  • Changes to the data driver interface to support all this

A lot of these changes are intertwined so I've put all this here as a draft, but I'll probably need to break this apart into separate PRs.

BlobTree

BlobTree now has a largely dictionary-like interface:

  • List keys (ie, file and directory names): keys(tree)
  • List keys and values: pairs(tree)
  • Query keys: haskey(tree, path)
  • Traverse the tree: tree[path]
  • Add new content: newdir(tree, path), newfile(tree, path)
  • Delete content: delete!(tree, path)

Where path is either a relative path RelPath type, or an AbstractString (in which case it'll be split on / to become a relative path).

Unlike Dict, iteration of BlobTree currently iterates values (not key value pairs). This has some benefits - for example, broadcasting processing across files in a directory.

  • Property access
    • isdir(), isfile() - determine whether a child of tree is a directory or file.

Example

You can create a new temporary BlobTree via the newdir() function and fill it with combinations of newfile() or newdir()

julia> dir = newdir()
       for i = 1:3
           newfile(dir, "\$i/a.txt") do io
               println(io, "Content of a")
           end
           newfile(dir, "b-\$i.txt") do io
               println(io, "Content of b")
           end
       end
       dir
📂 Tree  @ /tmp/jl_Sp6wMF
 📁 1
 📁 2
 📁 3
 📄 b-1.txt
 📄 b-2.txt
 📄 b-3.txt

You can also get access to a BlobTree by using DataSets.from_path() with a
local directory name. For example:

julia> using Pkg
       open(DataSets.from_path(joinpath(Pkg.dir("DataSets"), "src")))
📂 Tree  @ ~/.julia/dev/DataSets/src
 📄 DataSet.jl
 📄 DataSets.jl
 📄 DataTomlStorage.jl
 ...

AbstractDataProject interface additions

To support CRUD of datasets (#31) within data projects, the data driver interface needs much more flexibility. I've added:

  • DataSets.create() to create datasets — still needs some refinement, in particular the keyword parameters.
  • Base.setindex!() to add a dataset to a project
  • DataSets.delete() to delete datasets
  • Implementations for StackedDataProject, AbstractTOMLDataProject and TOMLDataProject

Relatedly, I've added DataSets.from_path() to create a standalone DataSet from data on the local filesystem, inferring the type as Blob or BlobTree. This can be passed as a source to create() to make a copy.

Still TODO here is DataSets.config (or some such) to update the metadata of a DataSet (alternatively — have the dataset know its owning data project and call back into that when it's updated?)

Low level AbstractDataDriver interface

The low level driver interface is currently (in 0.2.6) just a function taking a user-defined callback.

However, to support CRUD operations for DataProject it needs to be expanded quite a bit. In particular to be able to create and delete storage in the storage backend. This PR adds AbstractDataDriver and, so far a single implementation FileSystemDriver with implementations of

  • open_dataset to do what the current function-based API does
  • close_dataset to cleanup any dataset resources, also indicating whether the close happened due to an exception.
  • create_storage to initialize storage
  • delete_storage to remove storage

This interface is probably still a bit half-baked and needs some refinement.

@c42f c42f force-pushed the cjf/dataset-mutation branch from 144353b to 604dc8b Compare April 28, 2022 06:39
@c42f c42f added this to the 1.0 milestone Apr 28, 2022
@c42f c42f force-pushed the cjf/dataset-mutation branch from 604dc8b to 85ee4c0 Compare April 28, 2022 07:38
I had the TomlDataStorage struct inside DataTomlStorage.jl ??
@c42f c42f mentioned this pull request May 6, 2022
15 tasks
@c42f c42f force-pushed the cjf/dataset-mutation branch from 85ee4c0 to 6bf32a2 Compare May 23, 2022 05:26
c42f added 2 commits May 27, 2022 15:44
* For data projects:
  - create() to create datasets
  - setindex!() to add existing datasets
  - delete() to delete datasets
  - Implementations for StackedDataProject, AbstractTOMLDataProject and TOMLDataProject

* Concrete save_project() API to persist a DataProject to a file as TOML

* For storage drivers:
  - AbstractDataDriver and implementation for FileSystemDriver
  - open_dataset to do what the current function-based API does
  - create_storage to initialize storage
  - delete_storage to remove storage
  - These ideas seem a bit half-baked

* Refactoring open() to add write=true keyword
@mortenpi
Copy link
Member

I'll go ahead and close this PR, since I don't think we'll merge it. But the branch and discussion will stay around for future reference.

@mortenpi mortenpi closed this Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants