diff --git a/docs/user-guide/generic-providers.ipynb b/docs/user-guide/generic-providers.ipynb new file mode 100644 index 00000000..80cd0685 --- /dev/null +++ b/docs/user-guide/generic-providers.ipynb @@ -0,0 +1,444 @@ +{ + "cells": [ + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Generic Providers\n", + "\n", + "## Overview\n", + "\n", + "Sometimes we may want to replicate parts of a workflow and apply it to a different input.\n", + "In [Parameter Tables](parameter-tables.ipynb) we saw how we can use a parameter table for this purpose.\n", + "Another approach is to use [generics](https://mypy.readthedocs.io/en/stable/generics.html) to define a generic provider.\n", + "\n", + "While parameter tables are well suited for \"homogeneous\" tasks such as a set of images, generics can be a better choice for \"heterogeneous\" tasks where the computation steps are (nearly) identical, but the purpose of the computation is different.\n", + "For example, we may have a \"data\" image and a \"background\" image (from a sensor background measurement).\n", + "We want to perform some initial homogeneous operations on both, but ultimately reach a point where we want to subtract the background from the data.\n", + "\n", + "Generally speaking, using Sciline with generics can have several advantages:\n", + "\n", + "- Intuitive syntax for requesting computation of intermediate results.\n", + "- Specialized providers for intermediate steps can be used instead of the necessarily identical providers when working with parameter tables.\n", + "- Maintain reusability of providers and avoid synchronization points in workflows by combining parameter tables and generic providers.\n", + "\n", + "Before moving on to the next sections where we will elaborate on these points, consider how generic providers can be used straightforwardly in a pipeline.\n", + "For example, we can setup a pipeline computing a list of any type (provided that there is a parameter of the type) in a very compact manner:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import TypeVar, List\n", + "import sciline\n", + "\n", + "T = TypeVar(\"T\")\n", + "\n", + "\n", + "def duplicate(x: T) -> List[T]:\n", + " \"\"\"A generic provider that can make any list.\"\"\"\n", + " return [x, x]\n", + "\n", + "\n", + "pipeline = sciline.Pipeline([duplicate], params={int: 1, float: 2.0, str: \"3\"})\n", + "\n", + "print(pipeline.compute(List[int]))\n", + "print(pipeline.compute(List[float]))\n", + "print(pipeline.compute(List[str]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Counter example: Naive approach\n", + "\n", + "Starting with the model workflow introduced in [Getting Started](getting-started.ipynb), consider an extension where we also need to subtract a background signal from the data.\n", + "Naively, we could extend the example as follows, which is very verbose and error prone due to the duplication:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import NewType\n", + "import sciline\n", + "\n", + "_fake_filesytem = {\n", + " 'file102.txt': [1, 2, float('nan'), 3],\n", + " 'file103.txt': [1, 2, 3, 4],\n", + " 'file104.txt': [1, 2, 3, 4, 5],\n", + " 'file105.txt': [1, 2, 3],\n", + " 'background.txt': [0.1, 0.1],\n", + "}\n", + "\n", + "# 1. Define domain types\n", + "\n", + "Filename = NewType('Filename', str)\n", + "BackgroundFilename = NewType('BackgroundFilename', str)\n", + "RawData = NewType('RawData', dict)\n", + "RawBackground = NewType('RawBackground', dict)\n", + "CleanedData = NewType('CleanedData', list)\n", + "CleanedBackground = NewType('CleanedBackground', list)\n", + "ScaleFactor = NewType('ScaleFactor', float)\n", + "Background = NewType('Background', list)\n", + "BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)\n", + "Result = NewType('Result', float)\n", + "\n", + "\n", + "# 2. Define providers\n", + "\n", + "\n", + "def _load(filename: str) -> dict:\n", + " data = _fake_filesytem[filename]\n", + " return {'data': data, 'meta': {'filename': filename}}\n", + "\n", + "\n", + "def load(filename: Filename) -> RawData:\n", + " \"\"\"Load the data from the filename.\"\"\"\n", + "\n", + " return RawData(_load(filename))\n", + "\n", + "\n", + "def load_background(filename: BackgroundFilename) -> RawBackground:\n", + " \"\"\"Load the background from the filename.\"\"\"\n", + " return RawBackground(_load(filename))\n", + "\n", + "\n", + "def _clean(raw_data: dict) -> list:\n", + " import math\n", + "\n", + " return [x for x in raw_data['data'] if not math.isnan(x)]\n", + "\n", + "\n", + "def clean(raw_data: RawData) -> CleanedData:\n", + " \"\"\"Clean the data, removing NaNs.\"\"\"\n", + " return CleanedData(_clean(raw_data))\n", + "\n", + "\n", + "def clean_background(raw_data: RawBackground) -> CleanedBackground:\n", + " \"\"\"Clean the background, removing NaNs.\"\"\"\n", + " return CleanedBackground(_clean(raw_data))\n", + "\n", + "\n", + "def subtract_background(\n", + " data: CleanedData, background: CleanedBackground\n", + ") -> BackgroundSubtractedData:\n", + " return BackgroundSubtractedData([x - sum(background) for x in data])\n", + "\n", + "\n", + "def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:\n", + " \"\"\"Process the data, multiplying the sum by the scale factor.\"\"\"\n", + " return Result(sum(data) * param)\n", + "\n", + "\n", + "# 3. Create pipeline\n", + "\n", + "providers = [\n", + " load,\n", + " load_background,\n", + " clean,\n", + " clean_background,\n", + " process,\n", + " subtract_background,\n", + "]\n", + "params = {\n", + " ScaleFactor: 2.0,\n", + " Filename: 'file102.txt',\n", + " BackgroundFilename: 'background.txt',\n", + "}\n", + "pipeline = sciline.Pipeline(providers, params=params)\n", + "\n", + "print(f'Result={pipeline.compute(Result)}')\n", + "pipeline.visualize(Result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We would like to reuse the `load` and `clean` functions for the background, without having to add wrappers and without duplicating all the involved domain types (`BackgroundFilename`, `RawBackgroundRaw`, and `CleanedBackground`).\n", + "However, we cannot do this directly, since `Filename`, `RawData`, and `CleanedData` are unique identifiers specific to the non-background files and the uniqueness forms the foundation of how Sciline works.\n", + "\n", + "Sciline seeks to address this conundrum by providing a mechanism for using *generic providers* and for defining generic type aliases, introduced in the next section." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generic domain types and providers\n", + "\n", + "To avoid duplicates of domain types and providers, we may instead define generic domain types and generic providers.\n", + "The example is then written as:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import NewType, TypeVar\n", + "import sciline\n", + "\n", + "_fake_filesytem = {\n", + " 'file102.txt': [1, 2, float('nan'), 3],\n", + " 'file103.txt': [1, 2, 3, 4],\n", + " 'file104.txt': [1, 2, 3, 4, 5],\n", + " 'file105.txt': [1, 2, 3],\n", + " 'background.txt': [0.1, 0.1],\n", + "}\n", + "\n", + "# 1. Define domain types\n", + "\n", + "# 1.a Define generic domain types\n", + "RunType = TypeVar('RunType')\n", + "\n", + "\n", + "class Filename(sciline.Scope[RunType, str], str):\n", + " ...\n", + "\n", + "\n", + "class RawData(sciline.Scope[RunType, dict], dict):\n", + " ...\n", + "\n", + "\n", + "class CleanedData(sciline.Scope[RunType, list], list):\n", + " ...\n", + "\n", + "\n", + "# 1.b Define concrete RunType values we will use.\n", + "Sample = NewType('Sample', int)\n", + "Background = NewType('Background', int)\n", + "\n", + "# 1.c Define normal domain types\n", + "ScaleFactor = NewType('ScaleFactor', float)\n", + "BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)\n", + "Result = NewType('Result', float)\n", + "\n", + "\n", + "# 2. Define providers\n", + "\n", + "# 2.a Define generic providers\n", + "\n", + "\n", + "def load(filename: Filename[RunType]) -> RawData[RunType]:\n", + " \"\"\"Load the data from the filename.\"\"\"\n", + "\n", + " data = _fake_filesytem[filename]\n", + " return RawData[RunType]({'data': data, 'meta': {'filename': filename}})\n", + "\n", + "\n", + "def clean(raw_data: RawData[RunType]) -> CleanedData[RunType]:\n", + " \"\"\"Clean the data, removing NaNs.\"\"\"\n", + " import math\n", + "\n", + " return CleanedData[RunType]([x for x in raw_data['data'] if not math.isnan(x)])\n", + "\n", + "\n", + "# 2.b Define normal providers\n", + "def subtract_background(\n", + " data: CleanedData[Sample], background: CleanedData[Background]\n", + ") -> BackgroundSubtractedData:\n", + " return BackgroundSubtractedData([x - sum(background) for x in data])\n", + "\n", + "\n", + "def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:\n", + " \"\"\"Process the data, multiplying the sum by the scale factor.\"\"\"\n", + " return Result(sum(data) * param)\n", + "\n", + "\n", + "# 3. Create pipeline\n", + "\n", + "providers = [load, clean, process, subtract_background]\n", + "params = {\n", + " ScaleFactor: 2.0,\n", + " Filename[Sample]: 'file102.txt',\n", + " Filename[Background]: 'background.txt',\n", + "}\n", + "pipeline = sciline.Pipeline(providers, params=params)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Apart from updated type annotations for `load` and `clean`, the code is nearly identical to the original example without background subtraction.\n", + "\n", + "
\n", + "\n", + "Note\n", + "\n", + "We use a peculiar-looking syntax for defining \"generic type aliases\".\n", + "We would love to use [typing.NewType](https://docs.python.org/3/library/typing.html#typing.NewType) for this, but it does not allow for definition of generic aliases.\n", + "The syntax we use (subclassing [sciline.Scope](../generated/classes/sciline.Scope.rst)) is a workaround for defining generic aliases that work both at runtime and with [mypy](https://mypy-lang.org/):\n", + "\n", + "```python\n", + "class Filename(sciline.Scope[RunType, str], str):\n", + " ...\n", + "```\n", + "\n", + "
\n", + "\n", + "We can get or compute the result as usual:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "graph = pipeline.get(Result)\n", + "print(f'Result={graph.compute()}')\n", + "graph.visualize()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, we could have achieved something similar to the above computation graph using the [Parameter Tables](parameter-tables.ipynb) feature.\n", + "In the next section we will go through the advantages of using generic providers.\n", + "\n", + "## Advantages of using generic providers\n", + "\n", + "### Computing intermediate results\n", + "\n", + "Generic domain types with named scopes make it simple to request computation of intermediate results with a clear notation:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "pipeline.compute(CleanedData[Sample])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Specialized providers\n", + "\n", + "We may wish to specialize a provider for specific values of a generic's type parameters.\n", + "For example, we may need to use distinct cleaning functions for `Sample` and `Background`.\n", + "We can do so simply by defining a specialized provider for each type:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def clean_data(raw_data: RawData[Sample]) -> CleanedData[Sample]:\n", + " import math\n", + "\n", + " return CleanedData[Sample]([x for x in raw_data['data'] if not math.isnan(x)])\n", + "\n", + "\n", + "def clean_background(raw_data: RawData[Background]) -> CleanedData[Background]:\n", + " return CleanedData[Background]([x for x in raw_data['data'] if not x < 0])\n", + "\n", + "\n", + "providers = [load, clean_data, clean_background, process, subtract_background]\n", + "pipeline = sciline.Pipeline(providers, params=params)\n", + "pipeline.visualize(Result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Generic providers and parameter tables\n", + "\n", + "As a more complex example of where generic providers are useful, we may add a parameter table, so we can process multiple samples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "RunID = NewType('RunID', int)\n", + "run_ids = [102, 103, 104, 105]\n", + "filenames = [f'file{i}.txt' for i in run_ids]\n", + "param_table = sciline.ParamTable(RunID, {Filename[Sample]: filenames}, index=run_ids)\n", + "param_table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now create a parametrized pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "params = {\n", + " ScaleFactor: 2.0,\n", + " Filename[Background]: 'background.txt',\n", + "}\n", + "pipeline = sciline.Pipeline(providers, params=params)\n", + "pipeline.set_param_table(param_table)\n", + "graph = pipeline.get(sciline.Series[RunID, Result])\n", + "graph.visualize()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "graph.compute()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "With the generics mechanism, we could have added a line for the background to the parameter table.\n", + "However, the `subtrack_background` function would then have to be modified to accept a `Series[RunID, CleanedData]`.\n", + "More importanly, this would have resulted in a synchronization point in the computation graph, preventing efficient scheduling of the subsequent computation, with potentially disastrous effects on memory consumption." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "base", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md index 3baa2628..40df22dd 100644 --- a/docs/user-guide/index.md +++ b/docs/user-guide/index.md @@ -7,4 +7,5 @@ maxdepth: 2 getting-started parameter-tables +generic-providers ```