CRUD updates, dataset mutation and BlobTree API updates #38

c42f · 2022-04-26T11:05:18Z

This is a big batch of changes, implementing

A big rewrite of the BlobTree API to make it more coherent and simpler
Allow a dataset to be opened for mutation with open(write=true).
A CRUD interface for modifying DataProject
Changes to the data driver interface to support all this

A lot of these changes are intertwined so I've put all this here as a draft, but I'll probably need to break this apart into separate PRs.

BlobTree

BlobTree now has a largely dictionary-like interface:

List keys (ie, file and directory names): keys(tree)
List keys and values: pairs(tree)
Query keys: haskey(tree, path)
Traverse the tree: tree[path]
Add new content: newdir(tree, path), newfile(tree, path)
Delete content: delete!(tree, path)

Where path is either a relative path RelPath type, or an AbstractString (in which case it'll be split on / to become a relative path).

Unlike Dict, iteration of BlobTree currently iterates values (not key value pairs). This has some benefits - for example, broadcasting processing across files in a directory.

Property access
- isdir(), isfile() - determine whether a child of tree is a directory or file.

Example

You can create a new temporary BlobTree via the newdir() function and fill it with combinations of newfile() or newdir()

julia> dir = newdir()
       for i = 1:3
           newfile(dir, "\$i/a.txt") do io
               println(io, "Content of a")
           end
           newfile(dir, "b-\$i.txt") do io
               println(io, "Content of b")
           end
       end
       dir
📂 Tree  @ /tmp/jl_Sp6wMF
 📁 1
 📁 2
 📁 3
 📄 b-1.txt
 📄 b-2.txt
 📄 b-3.txt

You can also get access to a BlobTree by using DataSets.from_path() with a
local directory name. For example:

julia> using Pkg
       open(DataSets.from_path(joinpath(Pkg.dir("DataSets"), "src")))
📂 Tree  @ ~/.julia/dev/DataSets/src
 📄 DataSet.jl
 📄 DataSets.jl
 📄 DataTomlStorage.jl
 ...

AbstractDataProject interface additions

To support CRUD of datasets (#31) within data projects, the data driver interface needs much more flexibility. I've added:

DataSets.create() to create datasets — still needs some refinement, in particular the keyword parameters.
Base.setindex!() to add a dataset to a project
DataSets.delete() to delete datasets
Implementations for StackedDataProject, AbstractTOMLDataProject and TOMLDataProject

Relatedly, I've added DataSets.from_path() to create a standalone DataSet from data on the local filesystem, inferring the type as Blob or BlobTree. This can be passed as a source to create() to make a copy.

Still TODO here is DataSets.config (or some such) to update the metadata of a DataSet (alternatively — have the dataset know its owning data project and call back into that when it's updated?)

Low level `AbstractDataDriver` interface

The low level driver interface is currently (in 0.2.6) just a function taking a user-defined callback.

However, to support CRUD operations for DataProject it needs to be expanded quite a bit. In particular to be able to create and delete storage in the storage backend. This PR adds AbstractDataDriver and, so far a single implementation FileSystemDriver with implementations of

open_dataset to do what the current function-based API does
close_dataset to cleanup any dataset resources, also indicating whether the close happened due to an exception.
create_storage to initialize storage
delete_storage to remove storage

This interface is probably still a bit half-baked and needs some refinement.

I had the TomlDataStorage struct inside DataTomlStorage.jl ??

* For data projects: - create() to create datasets - setindex!() to add existing datasets - delete() to delete datasets - Implementations for StackedDataProject, AbstractTOMLDataProject and TOMLDataProject * Concrete save_project() API to persist a DataProject to a file as TOML * For storage drivers: - AbstractDataDriver and implementation for FileSystemDriver - open_dataset to do what the current function-based API does - create_storage to initialize storage - delete_storage to remove storage - These ideas seem a bit half-baked * Refactoring open() to add write=true keyword

mortenpi · 2023-11-30T09:29:20Z

I'll go ahead and close this PR, since I don't think we'll merge it. But the branch and discussion will stay around for future reference.

c42f force-pushed the cjf/dataset-mutation branch from 144353b to 604dc8b Compare April 28, 2022 06:39

c42f added this to the 1.0 milestone Apr 28, 2022

c42f force-pushed the cjf/dataset-mutation branch from 604dc8b to 85ee4c0 Compare April 28, 2022 07:38

Fix file naming inconsistency

2218d0d

I had the TomlDataStorage struct inside DataTomlStorage.jl ??

c42f mentioned this pull request May 6, 2022

The road to DataSets 1.0 #43

Open

15 tasks

c42f force-pushed the cjf/dataset-mutation branch from 85ee4c0 to 6bf32a2 Compare May 23, 2022 05:26

c42f added 2 commits May 27, 2022 15:44

Add copy / create / delete to data REPL

a73f278

c42f force-pushed the cjf/dataset-mutation branch from 6bf32a2 to a73f278 Compare May 27, 2022 05:44

mortenpi mentioned this pull request Nov 10, 2022

Concept of managed datasets for create/update/delete #56

Open

mortenpi closed this Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRUD updates, dataset mutation and BlobTree API updates #38

CRUD updates, dataset mutation and BlobTree API updates #38

c42f commented Apr 26, 2022 •

edited

Loading

mortenpi commented Nov 30, 2023

CRUD updates, dataset mutation and BlobTree API updates #38

CRUD updates, dataset mutation and BlobTree API updates #38

Conversation

c42f commented Apr 26, 2022 • edited Loading

BlobTree

Example

AbstractDataProject interface additions

Low level AbstractDataDriver interface

mortenpi commented Nov 30, 2023

c42f commented Apr 26, 2022 •

edited

Loading

Low level `AbstractDataDriver` interface