diff --git a/previews/PR71/design/index.html b/previews/PR71/design/index.html index 0bee0f8..5d4e4b3 100644 --- a/previews/PR71/design/index.html +++ b/previews/PR71/design/index.html @@ -4,4 +4,4 @@ open(joinpath(git_tree, "some_blob.txt"), write=true) do io write(io, "hi") end -end
There's at least two quite different use patterns for versioning:
open(filename, write=true, read=false)
. Your classic batch-mode application would function in this mode. You'd also want this when applying updates to the algorithm.open(filename, read=true, write=true)
. You'd want to use this pattern to support differential dataflow: The upstream input dataset(s) have a diff applied; the dataflow system infers how this propagates, with the resulting patch applied to the output datasets.Working with historical data can be confusing and error prone because the origin of that data may look like this:
The solution is to systematically record how data came to be, including input parameters and code version. This data provenance information comes from your activity as encoded in a possibly-interactive program, but must be stored alongside the data.
A full metadata system for data provenance is out of scope for DataSets.jl — it's a big project in its own right. But I think we should arrange the data lifecycle so that provenance can be hooked in easily by providing:
Some interesting links about provenance metadata:
The Data Model is the abstraction which the dataset user interacts with. In general this can be provided by some arbitrary Julia code from an arbitrary module. We'll need a way to map the DataSet
into the code which exposes the data model.
Examples, including some example storage formats which the data model might overlay
serialize
output)For distributed or incremental processing of large data, it must be possible to load data lazily and in parallel: no single node in the computation should need the whole dataset to be locally accessible.
Not every data model can support efficient parallel processing. But for those that do it seems that the following concepts are important:
To be clear, DataSets largely doesn't provide these things itself — these are up to implementations of particular data models. But the data lifecycle should be designed to efficiently support distributed computation.
This is one particular data model which I've tackle this as a first use case, as a "hieracical tree of data" is so common. Examples are
DataSets.FileTree
DataSets.GitTree
ZipFileTree
But we don't have a well-defined path tree abstraction which already exists! So I've been prototyping some things in this package. (See also FileTrees.jl which is a new and very recent package tackling similar things.)
What is a tree root object? It's a location for a data resource, including enough information to open that resource. It's the thing which handles the data lifecycle events on the whole tree.
What is a relative path, in general? It's a key into a heirarchical tree-structured data store. This consists of several path components (an array of strings)
isdir(child) == true
Settings
This document was generated with Documenter.jl on Friday 16 February 2024. Using Julia version 1.6.7.