build based on f57419b

JuliaComputing · Feb 16, 2024 · 1e5ab6c · 1e5ab6c
1 parent 37b2993
commit 1e5ab6c
Show file tree

Hide file tree

Showing 5 changed files with 12 additions and 12 deletions.
diff --git a/previews/PR71/design/index.html b/previews/PR71/design/index.html
@@ -4,4 +4,4 @@
     open(joinpath(git_tree, &quot;some_blob.txt&quot;), write=true) do io
         write(io, &quot;hi&quot;)
     end
-end</code></pre><p>There&#39;s at least two quite different use patterns for versioning:</p><ul><li>Batch update: the entire dataset is rewritten. A bit like <code>open(filename, write=true, read=false)</code>. Your classic batch-mode application would function in this mode. You&#39;d also want this when applying updates to the algorithm.</li><li>Incremental update: some data is incrementally added or removed from the dataset. A bit like <code>open(filename, read=true, write=true)</code>. You&#39;d want to use this pattern to support differential dataflow: The upstream input dataset(s) have a diff applied; the dataflow system infers how this propagates, with the resulting patch applied to the output datasets.</li></ul><h3 id="Provenance:-What-is-this-data?-What-was-I-thinking?"><a class="docs-heading-anchor" href="#Provenance:-What-is-this-data?-What-was-I-thinking?">Provenance: What is this data? What was I thinking?</a><a id="Provenance:-What-is-this-data?-What-was-I-thinking?-1"></a><a class="docs-heading-anchor-permalink" href="#Provenance:-What-is-this-data?-What-was-I-thinking?" title="Permalink"></a></h3><p>Working with historical data can be confusing and error prone because the origin of that data may look like this:</p><p><img src="https://imgs.xkcd.com/comics/machine_learning.png" alt="[xkcd 1838](https://xkcd.com/1838)"/></p><p>The solution is to systematically record how data came to be, including input parameters and code version. This <em>data provenance</em> information comes from your activity as encoded in a possibly-interactive program, but must be stored alongside the data.</p><p>A full metadata system for data provenance is out of scope for DataSets.jl — it&#39;s a big project in its own right. But I think we should arrange the data lifecycle so that provenance can be hooked in easily by providing:</p><ul><li><em>Data lifecycle events</em> which can be used to trigger the generation and storage of provenance metadata.</li><li>A standard entry point to user code, which makes output datasets aware of input datasets.</li></ul><p>Some interesting links about provenance metadata:</p><ul><li>Watch this talk: <em>Intro to PROV</em> by Nicholas Car: https://www.youtube.com/watch?v=elPcKqWoOPg</li><li>The PROV primer: https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/#introduction</li><li>https://www.ands.org.au/working-with-data/publishing-and-reusing-data/data-provenance</li></ul><h2 id="Data-Models"><a class="docs-heading-anchor" href="#Data-Models">Data Models</a><a id="Data-Models-1"></a><a class="docs-heading-anchor-permalink" href="#Data-Models" title="Permalink"></a></h2><p>The Data Model is the abstraction which the dataset user interacts with. In general this can be provided by some arbitrary Julia code from an arbitrary module. We&#39;ll need a way to map the <code>DataSet</code> into the code which exposes the data model.</p><p>Examples, including some example storage formats which the data model might overlay</p><ul><li>Path-indexed tree-like data (Filesystem, Git, S3, Zip, HDF5)</li><li>Arrays (raw, HDF5+path, .npy, many image formats, geospatial rasters on WMTS)</li><li>Blobs (the unstructured vector of bytes)</li><li>Tables (csv, tsv, parquet)</li><li>Julia objects (JLD / JLD2 / <code>serialize</code> output)</li></ul><h3 id="Distributed-and-incremental-processing"><a class="docs-heading-anchor" href="#Distributed-and-incremental-processing">Distributed and incremental processing</a><a id="Distributed-and-incremental-processing-1"></a><a class="docs-heading-anchor-permalink" href="#Distributed-and-incremental-processing" title="Permalink"></a></h3><p>For distributed or incremental processing of large data, it <strong>must be possible to load data lazily and in parallel</strong>: no single node in the computation should need the whole dataset to be locally accessible.</p><p>Not every data model can support efficient parallel processing. But for those that do it seems that the following concepts are important:</p><ul><li><em>keys</em> - the user works in terms of keys, eg, the indices of an array, the elements of a set, etc.</li><li><em>indices</em> - allow data to be looked up via the keys, quickly.</li><li><em>partitions</em> - large datasets must be partitioned across machines (distributed processing) or time (incremental processing with lazy loading).  The user may not want to know about this but the scheduler does.</li></ul><p>To be clear, DataSets largely doesn&#39;t provide these things itself — these are up to implementations of particular data models. But the data lifecycle should be designed to efficiently support distributed computation.</p><h3 id="Tree-indexed-data"><a class="docs-heading-anchor" href="#Tree-indexed-data">Tree-indexed data</a><a id="Tree-indexed-data-1"></a><a class="docs-heading-anchor-permalink" href="#Tree-indexed-data" title="Permalink"></a></h3><p>This is one particular data model which I&#39;ve tackle this as a first use case, as a &quot;hieracical tree of data&quot; is so common. Examples are</p><ul><li>The filesystem - See <code>DataSets.FileTree</code></li><li>git - See <code>DataSets.GitTree</code></li><li>Zip files - See <code>ZipFileTree</code></li><li>S3</li><li>HDF5</li></ul><p>But we don&#39;t have a well-defined path tree abstraction which already exists! So I&#39;ve been prototyping some things in this package. (See also FileTrees.jl which is a new and very recent package tackling similar things.)</p><h4 id="Paths-and-Roots"><a class="docs-heading-anchor" href="#Paths-and-Roots">Paths and Roots</a><a id="Paths-and-Roots-1"></a><a class="docs-heading-anchor-permalink" href="#Paths-and-Roots" title="Permalink"></a></h4><p>What is a <strong>tree root</strong> object? It&#39;s a location for a data resource, including enough information to open that resource. It&#39;s the thing which handles the data lifecycle events on the whole tree.</p><p>What is a <strong>relative path</strong>, in general? It&#39;s a <em>key</em> into a heirarchical tree-structured data store. This consists of several path <em>components</em> (an array of strings)</p><h4 id="Iteration"><a class="docs-heading-anchor" href="#Iteration">Iteration</a><a id="Iteration-1"></a><a class="docs-heading-anchor-permalink" href="#Iteration" title="Permalink"></a></h4><ul><li>Fundamentally about iteration over tree nodes</li><li>Iteration over a tree yields a list of children. Children may be:<ul><li>Another tree; <code>isdir(child) == true</code></li><li>Leaf data</li></ul></li></ul><h1 id="Interesting-related-projects"><a class="docs-heading-anchor" href="#Interesting-related-projects">Interesting related projects</a><a id="Interesting-related-projects-1"></a><a class="docs-heading-anchor-permalink" href="#Interesting-related-projects" title="Permalink"></a></h1><ul><li><a href="https://julialang.github.io/Pkg.jl/v1/artifacts/">Pkg.Artifacts</a> solves the problem of downloading &quot;artifacts&quot;: immutable containers of content-addressed tree data. Designed for the needs of distributing compiled libraries as dependencies of Julia projects, but can be used for any tree-structured data.</li><li><a href="https://github.com/oxinabox/DataDeps.jl">DataDeps.jl</a> solves the data downloading problem for static remote data.</li><li><a href="https://github.com/helgee/RemoteFiles.jl">RemoteFiles.jl</a> Downloads files from the internet and keeps them updated.</li><li><a href="https://arrow.apache.org/docs/python/dataset.html">pyarrow.dataset</a> is restricted to tabular data, but seems similar in spirit to DataSets.jl.</li><li><a href="http://shashi.biz/FileTrees.jl">FileTrees.jl</a> provides tools for representing and processing tree-structured data lazily and in parallel.</li></ul></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../reference/">« API Reference</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> on <span class="colophon-date" title="Friday 16 February 2024 04:48">Friday 16 February 2024</span>. Using Julia version 1.6.7.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>
+end</code></pre><p>There&#39;s at least two quite different use patterns for versioning:</p><ul><li>Batch update: the entire dataset is rewritten. A bit like <code>open(filename, write=true, read=false)</code>. Your classic batch-mode application would function in this mode. You&#39;d also want this when applying updates to the algorithm.</li><li>Incremental update: some data is incrementally added or removed from the dataset. A bit like <code>open(filename, read=true, write=true)</code>. You&#39;d want to use this pattern to support differential dataflow: The upstream input dataset(s) have a diff applied; the dataflow system infers how this propagates, with the resulting patch applied to the output datasets.</li></ul><h3 id="Provenance:-What-is-this-data?-What-was-I-thinking?"><a class="docs-heading-anchor" href="#Provenance:-What-is-this-data?-What-was-I-thinking?">Provenance: What is this data? What was I thinking?</a><a id="Provenance:-What-is-this-data?-What-was-I-thinking?-1"></a><a class="docs-heading-anchor-permalink" href="#Provenance:-What-is-this-data?-What-was-I-thinking?" title="Permalink"></a></h3><p>Working with historical data can be confusing and error prone because the origin of that data may look like this:</p><p><img src="https://imgs.xkcd.com/comics/machine_learning.png" alt="[xkcd 1838](https://xkcd.com/1838)"/></p><p>The solution is to systematically record how data came to be, including input parameters and code version. This <em>data provenance</em> information comes from your activity as encoded in a possibly-interactive program, but must be stored alongside the data.</p><p>A full metadata system for data provenance is out of scope for DataSets.jl — it&#39;s a big project in its own right. But I think we should arrange the data lifecycle so that provenance can be hooked in easily by providing:</p><ul><li><em>Data lifecycle events</em> which can be used to trigger the generation and storage of provenance metadata.</li><li>A standard entry point to user code, which makes output datasets aware of input datasets.</li></ul><p>Some interesting links about provenance metadata:</p><ul><li>Watch this talk: <em>Intro to PROV</em> by Nicholas Car: https://www.youtube.com/watch?v=elPcKqWoOPg</li><li>The PROV primer: https://www.w3.org/TR/2013/NOTE-prov-primer-20130430/#introduction</li><li>https://www.ands.org.au/working-with-data/publishing-and-reusing-data/data-provenance</li></ul><h2 id="Data-Models"><a class="docs-heading-anchor" href="#Data-Models">Data Models</a><a id="Data-Models-1"></a><a class="docs-heading-anchor-permalink" href="#Data-Models" title="Permalink"></a></h2><p>The Data Model is the abstraction which the dataset user interacts with. In general this can be provided by some arbitrary Julia code from an arbitrary module. We&#39;ll need a way to map the <code>DataSet</code> into the code which exposes the data model.</p><p>Examples, including some example storage formats which the data model might overlay</p><ul><li>Path-indexed tree-like data (Filesystem, Git, S3, Zip, HDF5)</li><li>Arrays (raw, HDF5+path, .npy, many image formats, geospatial rasters on WMTS)</li><li>Blobs (the unstructured vector of bytes)</li><li>Tables (csv, tsv, parquet)</li><li>Julia objects (JLD / JLD2 / <code>serialize</code> output)</li></ul><h3 id="Distributed-and-incremental-processing"><a class="docs-heading-anchor" href="#Distributed-and-incremental-processing">Distributed and incremental processing</a><a id="Distributed-and-incremental-processing-1"></a><a class="docs-heading-anchor-permalink" href="#Distributed-and-incremental-processing" title="Permalink"></a></h3><p>For distributed or incremental processing of large data, it <strong>must be possible to load data lazily and in parallel</strong>: no single node in the computation should need the whole dataset to be locally accessible.</p><p>Not every data model can support efficient parallel processing. But for those that do it seems that the following concepts are important:</p><ul><li><em>keys</em> - the user works in terms of keys, eg, the indices of an array, the elements of a set, etc.</li><li><em>indices</em> - allow data to be looked up via the keys, quickly.</li><li><em>partitions</em> - large datasets must be partitioned across machines (distributed processing) or time (incremental processing with lazy loading).  The user may not want to know about this but the scheduler does.</li></ul><p>To be clear, DataSets largely doesn&#39;t provide these things itself — these are up to implementations of particular data models. But the data lifecycle should be designed to efficiently support distributed computation.</p><h3 id="Tree-indexed-data"><a class="docs-heading-anchor" href="#Tree-indexed-data">Tree-indexed data</a><a id="Tree-indexed-data-1"></a><a class="docs-heading-anchor-permalink" href="#Tree-indexed-data" title="Permalink"></a></h3><p>This is one particular data model which I&#39;ve tackle this as a first use case, as a &quot;hieracical tree of data&quot; is so common. Examples are</p><ul><li>The filesystem - See <code>DataSets.FileTree</code></li><li>git - See <code>DataSets.GitTree</code></li><li>Zip files - See <code>ZipFileTree</code></li><li>S3</li><li>HDF5</li></ul><p>But we don&#39;t have a well-defined path tree abstraction which already exists! So I&#39;ve been prototyping some things in this package. (See also FileTrees.jl which is a new and very recent package tackling similar things.)</p><h4 id="Paths-and-Roots"><a class="docs-heading-anchor" href="#Paths-and-Roots">Paths and Roots</a><a id="Paths-and-Roots-1"></a><a class="docs-heading-anchor-permalink" href="#Paths-and-Roots" title="Permalink"></a></h4><p>What is a <strong>tree root</strong> object? It&#39;s a location for a data resource, including enough information to open that resource. It&#39;s the thing which handles the data lifecycle events on the whole tree.</p><p>What is a <strong>relative path</strong>, in general? It&#39;s a <em>key</em> into a heirarchical tree-structured data store. This consists of several path <em>components</em> (an array of strings)</p><h4 id="Iteration"><a class="docs-heading-anchor" href="#Iteration">Iteration</a><a id="Iteration-1"></a><a class="docs-heading-anchor-permalink" href="#Iteration" title="Permalink"></a></h4><ul><li>Fundamentally about iteration over tree nodes</li><li>Iteration over a tree yields a list of children. Children may be:<ul><li>Another tree; <code>isdir(child) == true</code></li><li>Leaf data</li></ul></li></ul><h1 id="Interesting-related-projects"><a class="docs-heading-anchor" href="#Interesting-related-projects">Interesting related projects</a><a id="Interesting-related-projects-1"></a><a class="docs-heading-anchor-permalink" href="#Interesting-related-projects" title="Permalink"></a></h1><ul><li><a href="https://julialang.github.io/Pkg.jl/v1/artifacts/">Pkg.Artifacts</a> solves the problem of downloading &quot;artifacts&quot;: immutable containers of content-addressed tree data. Designed for the needs of distributing compiled libraries as dependencies of Julia projects, but can be used for any tree-structured data.</li><li><a href="https://github.com/oxinabox/DataDeps.jl">DataDeps.jl</a> solves the data downloading problem for static remote data.</li><li><a href="https://github.com/helgee/RemoteFiles.jl">RemoteFiles.jl</a> Downloads files from the internet and keeps them updated.</li><li><a href="https://arrow.apache.org/docs/python/dataset.html">pyarrow.dataset</a> is restricted to tabular data, but seems similar in spirit to DataSets.jl.</li><li><a href="http://shashi.biz/FileTrees.jl">FileTrees.jl</a> provides tools for representing and processing tree-structured data lazily and in parallel.</li></ul></article><nav class="docs-footer"><a class="docs-footer-prevpage" href="../reference/">« API Reference</a><div class="flexbox-break"></div><p class="footer-message">Powered by <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> and the <a href="https://julialang.org/">Julia Programming Language</a>.</p></nav></div><div class="modal" id="documenter-settings"><div class="modal-background"></div><div class="modal-card"><header class="modal-card-head"><p class="modal-card-title">Settings</p><button class="delete"></button></header><section class="modal-card-body"><p><label class="label">Theme</label><div class="select"><select id="documenter-themepicker"><option value="documenter-light">documenter-light</option><option value="documenter-dark">documenter-dark</option></select></div></p><hr/><p>This document was generated with <a href="https://github.com/JuliaDocs/Documenter.jl">Documenter.jl</a> on <span class="colophon-date" title="Friday 16 February 2024 05:31">Friday 16 February 2024</span>. Using Julia version 1.6.7.</p></section><footer class="modal-card-foot"></footer></div></div></div></body></html>