From fa65a57ddf66544286d5c39532dfb052c37ff9d2 Mon Sep 17 00:00:00 2001 From: Luuk van der Meer Date: Tue, 5 Mar 2024 17:56:21 +0100 Subject: [PATCH] docs: Add cache examples to processor notebook :books: --- demo/{ => extras}/cache_tests.ipynb | 9 +-- demo/files/logs.txt | 18 +++++ demo/processor.ipynb | 120 ++++++++++++++++++++++++---- 3 files changed, 125 insertions(+), 22 deletions(-) rename demo/{ => extras}/cache_tests.ipynb (99%) diff --git a/demo/cache_tests.ipynb b/demo/extras/cache_tests.ipynb similarity index 99% rename from demo/cache_tests.ipynb rename to demo/extras/cache_tests.ipynb index 3d94a747..bf790f9b 100644 --- a/demo/cache_tests.ipynb +++ b/demo/extras/cache_tests.ipynb @@ -168,7 +168,7 @@ "source": [ "As you can see the preview run resolves the references to the data layers as they are provided by looking up the entities' references in the mapping.json. Note, that in the current case the result is not that interesting, though, since four different data layers are to be loaded. Therefore, there is nothing to be cached during recipe execution. Therefore the QueryProcessor will load all data layers from the referenced sources without storing any of them in the cache. \n", "\n", - "As a user, however, you can directly initiate the entire caching workflow (preview & full resolution recipe execution) by setting the context parameter when calling `recipe.execute(..., caching=True)`. " + "As a user, however, you can directly initiate the entire caching workflow (preview & full resolution recipe execution) by setting the context parameter when calling `recipe.execute(..., cache_data = True)`. " ] }, { @@ -178,7 +178,7 @@ "outputs": [], "source": [ "# same as above in a single step \n", - "result = recipe.execute(**{**context, \"caching\": True})" + "result = recipe.execute(**{**context, \"cache_data\": True})" ] }, { @@ -537,9 +537,8 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.10.1" - }, - "orig_nbformat": 4 + "version": "3.9.2" + } }, "nbformat": 4, "nbformat_minor": 2 diff --git a/demo/files/logs.txt b/demo/files/logs.txt index e878686a..675318fd 100644 --- a/demo/files/logs.txt +++ b/demo/files/logs.txt @@ -60,6 +60,9 @@ Attributes: _FillValue: 1.7976931348623157e+308 value_type: ordinal value_labels: {1: 'SVHNIR', 2: 'SVLNIR', 3: 'AVHNIR', 4: 'AVLNIR', 5: '... +Cache updated +Sequence of layers: [] +Currently cached layers: [] Applied verb evaluate: array([[[0., 0., 0., 0.], @@ -166,6 +169,9 @@ Attributes: _FillValue: 1.7976931348623157e+308 value_type: ordinal value_labels: {1: 'SVHNIR', 2: 'SVLNIR', 3: 'AVHNIR', 4: 'AVLNIR', 5: '... +Cache updated +Sequence of layers: [] +Currently cached layers: [] Applied verb evaluate: array([[[0., 0., 0., 0.], @@ -272,6 +278,9 @@ Attributes: _FillValue: 1.7976931348623157e+308 value_type: ordinal value_labels: {1: 'SVHNIR', 2: 'SVLNIR', 3: 'AVHNIR', 4: 'AVLNIR', 5: '... +Cache updated +Sequence of layers: [] +Currently cached layers: [] Applied verb evaluate: array([[[0., 0., 0., 0.], @@ -373,6 +382,9 @@ Attributes: _FillValue: 1.7976931348623157e+308 value_type: ordinal value_labels: {1: 'SVHNIR', 2: 'SVLNIR', 3: 'AVHNIR', 4: 'AVLNIR', 5: '... +Cache updated +Sequence of layers: [] +Currently cached layers: [] Applied verb evaluate: array([[[0., 0., 0., 0.], @@ -474,6 +486,9 @@ Attributes: _FillValue: 1.7976931348623157e+308 value_type: ordinal value_labels: {1: 'SVHNIR', 2: 'SVLNIR', 3: 'AVHNIR', 4: 'AVLNIR', 5: '... +Cache updated +Sequence of layers: [] +Currently cached layers: [] Applied verb evaluate: array([[[0., 0., 0., 0.], @@ -574,6 +589,9 @@ Attributes: _FillValue: 1.7976931348623157e+308 value_type: ordinal value_labels: {1: 'SVHNIR', 2: 'SVLNIR', 3: 'AVHNIR', 4: 'AVLNIR', 5: '... +Cache updated +Sequence of layers: [] +Currently cached layers: [] Applied verb evaluate: array([[[0., 0., 0., 0.], diff --git a/demo/processor.ipynb b/demo/processor.ipynb index fb29d6fb..1fc712db 100644 --- a/demo/processor.ipynb +++ b/demo/processor.ipynb @@ -65,7 +65,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Load a mapping.\n", + "# Load a recipe.\n", "with open(\"files/recipe.json\", \"r\") as file:\n", " recipe = sq.QueryRecipe(json.load(file))" ] @@ -591,7 +591,7 @@ " long_name: index\n", " _FillValue: nan\n", " value_type: nominal\n", - " value_labels: {1: 'feature_1'}
    • y
      PandasIndex
      PandasIndex(Index([2696250.0, 2694750.0, 2693250.0, 2691750.0], dtype='float64', name='y'))
    • x
      PandasIndex
      PandasIndex(Index([4530750.0, 4532250.0, 4533750.0, 4535250.0], dtype='float64', name='x'))
    • time
      PandasIndex
      PandasIndex(DatetimeIndex(['2019-01-01', '2020-12-31'], dtype='datetime64[ns]', name='time', freq=None))
  • name :
    index
    long_name :
    index
    _FillValue :
    nan
    value_type :
    nominal
    value_labels :
    {1: 'feature_1'}
  • " ], "text/plain": [ "\n", @@ -1188,7 +1188,7 @@ " scale_factor: 1.0\n", " add_offset: 0.0\n", " _FillValue: 1.7976931348623157e+308\n", - " value_type: binary
  • AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
    _FillValue :
    1.7976931348623157e+308
    value_type :
    binary
  • " ], "text/plain": [ "\n", @@ -2192,7 +2192,7 @@ " scale_factor: 1.0\n", " add_offset: 0.0\n", " _FillValue: 1.7976931348623157e+308\n", - " value_type: binary
  • AREA_OR_POINT :
    Area
    scale_factor :
    1.0
    add_offset :
    0.0
    _FillValue :
    1.7976931348623157e+308
    value_type :
    binary
  • " ], "text/plain": [ "\n", @@ -2966,6 +2966,92 @@ "Semantique also allow to export an array to either a CSV file or a GeoTIFF file (requires spatial dimensions). To do so, call respectively the [to_csv](https://zgis.github.io/semantique/_generated/semantique.processor.arrays.Array.to_csv.html) or [to_geotiff](https://zgis.github.io/semantique/_generated/semantique.processor.arrays.Array.to_geotiff.html) methods through the [sq-accessor](#Data-structures) of the arrays." ] }, + { + "cell_type": "markdown", + "id": "d3481e84", + "metadata": {}, + "source": [ + "## Caching data layers\n", + "\n", + "The query processor allows to cache retrieved data layers to reduce RAM memory requirements if the same data layer is referenced multiple times in the query recipe or the mapping. RAM memory requirements are proportional to the number of data layers that are stored as intermediate results. Caching data layers in RAM should only be done for those that are needed again when evaluating downstream parts of the recipe. This requires foresight about the execution order of the recipe, which accordingly requires a preview run preceding the actual execution. This preview run is performed by loading the data with drastically reduced spatial resolution (5x5 pixel grid). It resolves the data references and fills a cache by creating a list of the data references in the order in which they are evaluated. This list is then used dynamically during the actual execution of the recipe as a basis for keeping data layers in the cache and reading them from there if they are needed again.\n", + "\n", + "Below the result of the preview run is shown first to demonstrate what the resolved data references look like. You will see that the same data layer is referenced multiple times. The resulting initialised cache can then be fed as an argument to the QueryProcessor in a second step for the actual recipe execution. " + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "0591deae", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[['appearance', 'colortype'],\n", + " ['appearance', 'colortype'],\n", + " ['appearance', 'colortype'],\n", + " ['appearance', 'colortype'],\n", + " ['appearance', 'colortype'],\n", + " ['appearance', 'colortype']]" + ] + }, + "execution_count": 60, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Step I: preview run.\n", + "qp = QueryProcessor.parse(recipe, **{**context, \"preview\": True})\n", + "qp.optimize().execute()\n", + "qp.cache.seq" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "61f3b0dd", + "metadata": {}, + "outputs": [], + "source": [ + "# Step II: query processor execution.\n", + "qp = QueryProcessor.parse(recipe, **{**context, \"cache\": qp.cache})\n", + "response = qp.optimize().execute()" + ] + }, + { + "cell_type": "markdown", + "id": "02461c73", + "metadata": {}, + "source": [ + "When executing a query recipe you can directly initiate the entire caching workflow (preview & full resolution recipe execution) by setting the \"cache_data\" argument to `True`:" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "fa4eca40", + "metadata": {}, + "outputs": [], + "source": [ + "# Same as above in a single step.\n", + "response = recipe.execute(**{**context, \"cache_data\": True})" + ] + }, + { + "cell_type": "markdown", + "id": "aca485b4", + "metadata": {}, + "source": [ + "Caching does not always lead to a increase in performance. The effect depends on:\n", + "\n", + "* The resolution in which the query recipe is executed.\n", + "* The redundancy of the data references in the recipe, i.e. if layers are called multiple times loading them from cache will reduce the overall time significantly.\n", + "* The data source (EO data cube) from which they are retrieved.\n", + "\n", + "It should be noted that in our demos only data loaded from locally stored GeoTIFF files are analysed. This is sort of the worst case for demonstrating the benefits of caching since the data is stored locally and is therefore quickly accessible. Also, GeoTIFF files that are not stored in cloud-optimised format (CoGs) require to load the whole data into memory even when running in preview mode just to evaluate the sequence of data layers. Keep in mind, however, that caching is designed for and particularly beneficial in case of STACCubes when loading data over the internet." + ] + }, { "cell_type": "markdown", "id": "bc13ea19", @@ -2978,7 +3064,7 @@ }, { "cell_type": "code", - "execution_count": 60, + "execution_count": 63, "id": "0eda66ec", "metadata": {}, "outputs": [], @@ -2993,7 +3079,7 @@ }, { "cell_type": "code", - "execution_count": 61, + "execution_count": 64, "id": "6983f511", "metadata": {}, "outputs": [],