From 88d9028130aecf557a0b64c4652ce55daf0e8bbe Mon Sep 17 00:00:00 2001 From: Oliver Strickson Date: Thu, 26 Jan 2023 14:27:56 +0000 Subject: [PATCH 1/3] Add notes for SCIP on data loading --- docs/scip/0004-data-loading.md | 92 ++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 docs/scip/0004-data-loading.md diff --git a/docs/scip/0004-data-loading.md b/docs/scip/0004-data-loading.md new file mode 100644 index 00000000..b79cdcb3 --- /dev/null +++ b/docs/scip/0004-data-loading.md @@ -0,0 +1,92 @@ +# Template for new SCIPs + +## Metadata + +Editor: + ... + +Status (raw | draft | stable | deprecated | retired): + raw + +## Description + +... + +## Requirements + +### What is included in the catalog entry for a datasource? + +- A URL to a remote location (as given below) + - The URL should be a browsable location, structured according to one of the supported 'data-sharing patterns' (see below) + - + +- An indication of additional scivision plugins required to load the data, if not (?) + +### Image formats + +- Built-in support for any common format (via a library, such as skimage) +- Built-in support for formats common across scientific domains, not included in the above + - Whether to support a given format should be considered against the cost of the additional dependencies it requires, and the burden of these (e.g. something that makes core scivision less portable, or adds an extra installation step might be rejected, but a single python-only dependency considered acceptable) + +- A 'plugin' system for extending to additional formats + +#### Additional image formats + +Below is a list of additional image formats to consider for built-in support + +- +- + + +### Supported data services + +#### Notes + +- 'Core' scivision (without additional) should maintain support for several remote data is commonly archived. + +- Often locations are specified using URLs with a http/https scheme, but e.g. directory browsing is not supported by http, which limits the generality or usefulness of this approach. + +- One possibility that is supported by plain http is a direct link to 'archive' file system (e.g. a zip file containing one of the patterns below). + +- Examples consisting of a single image are supported for the same reason, but might not be particularly interesting + +- A single file containing some metadata for Intake or a scivision plugin is another possibility + +#### Particular services to support + +- Automatic support for single image files and archives +- The URL of an Intake catalog +- The URL of some data-plugin metadata +- Zenodo +- GitHub +- ... + +#### Pull requests accepted + + - Improve, updating, maintaining the existing supported services (e.g. fixing the library to work after an API change + + - Adding support for other common remote locations (a test might be: are there two or more independent data sources in the catalog that) + + +### Native support for common data sharing patterns + +#### Directory of image files + +#### Image + csv labels + +#### An Intake yaml file catalog + +#### A yaml file, with metadata for a custom data plugin + + +## High-level software design + +For Scivision.Py + +- Consider using fsspec for handling remote locations (get archive support, variety of URL schemes) + +- Abstract base class for a `DataService` + +## Remaining questions + +... From 8afa9aebbb0196e3b282137b2d6c6d9e3e0e4ea9 Mon Sep 17 00:00:00 2001 From: ots22 Date: Thu, 26 Jan 2023 17:11:57 +0000 Subject: [PATCH 2/3] Update docs/scip/0004-data-loading.md Co-authored-by: Ed Chalstrey --- docs/scip/0004-data-loading.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/scip/0004-data-loading.md b/docs/scip/0004-data-loading.md index b79cdcb3..56734e88 100644 --- a/docs/scip/0004-data-loading.md +++ b/docs/scip/0004-data-loading.md @@ -29,7 +29,10 @@ Status (raw | draft | stable | deprecated | retired): - Whether to support a given format should be considered against the cost of the additional dependencies it requires, and the burden of these (e.g. something that makes core scivision less portable, or adds an extra installation step might be rejected, but a single python-only dependency considered acceptable) - A 'plugin' system for extending to additional formats +### Loaded data formats +- Lazy loading with `dask`/`xarray` +- Simpler format such as `numpy` for smaller datasets when lazy load/ parallel computing not required #### Additional image formats Below is a list of additional image formats to consider for built-in support From 957efa187502d8806e4e61bbf4036b68dd858ee3 Mon Sep 17 00:00:00 2001 From: ots22 Date: Thu, 26 Jan 2023 17:14:45 +0000 Subject: [PATCH 3/3] Update docs/scip/0004-data-loading.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Alejandro © --- docs/scip/0004-data-loading.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/scip/0004-data-loading.md b/docs/scip/0004-data-loading.md index 56734e88..db862c74 100644 --- a/docs/scip/0004-data-loading.md +++ b/docs/scip/0004-data-loading.md @@ -62,7 +62,7 @@ Below is a list of additional image formats to consider for built-in support - The URL of some data-plugin metadata - Zenodo - GitHub -- ... +- Support layers which might be useful for labelling and inspection of model outputs e.g. Points, Shapes, Surface, Tracks and Vectors #### Pull requests accepted