Skip to content

workload definition

gereon edited this page Jan 31, 2024 · 3 revisions

Workload Definition

We designed the workload definition such that it can be divided into four parts: It first describes the location folders of the datasets, secondly contains the structure of the query, next the parameters to set or change, and last the systems to test for. All four parts are wrapped into a single experiments obejct.

Dataset locations

Defining the dataset locations is straight forward: In the given field, simply supply one folder each for the raster and vector dataset. These folders then contain the dataset files themselves. While looking like an overhead for self-contained files like .geojson or .geotiff, this allows RaVeN to easily handle multi-file datasets as they occur in ESRI Shapefiles.

  data:
    raster: /data/glc/GlcShare_v10_Dominant
    vector: /data/us_state

Query

As RaVeN is capable to test a plethora of systems which do not support a common query language, we cannot simply define our query using SQL or a similar query language. Instead, we decided to define a benchmark-specific query language that can be translated effortlessly into other query languages. As the focus of RaVeN is to only combine one vector and raster dataset each, our domain-specific query language only needs to support a subset of possible queries. The user therefore is limited to defining attributes for the following query parameters: get including aggregations, where, group, and order. Finally, for the join parameter the user has to define for the two input datasets, whether they contain vector or raster datasets. They also have to define a join condition, which uses a subset of spatial relations as possible conditions.

It is possible to aggregate on both raster and vector attributes. If an aggregation occurs, all non-aggregated attributes need to again occur in the group condition.

The value of a raster pixel is always called sval. RaVeN currently can neither handle multiple raster values per pixel within one band, nor operations on multiple bands.

The sample query part, wrapped inside the workload object:

  workload:
    get: 
      vector:
        - AFFGEOID
      raster: 
        - sval
        - sval:
            aggregations:
              - count
    join:
      table1: vector
      table2: raster
      condition: intersect(raster, vector)
    group:
      vector: 
        - AFFGEOID
      raster: 
        - sval
    order:
      vector: 
        - AFFGEOID
      raster: 
        - sval

Here, the query is kept rather simple: If the system does not support a special join operator, we use an intersect join between the raster and vector dataset. For the vector file we then project on the IDs of the different vector features, while for the raster values we project and group by the values of the pixels and then count their frequency within the vector feature. Finally, we order the gathered results.

Spatial Relations for Joins

RaVeN currently supports three different join conditions.

intersect

When choosing intersect, RaVeN will perform the join based on the ST_Intersects predicate.

contains

When choosing contains, RaVeN will perform the join based on the ST_Contains predicate.

bestrasvecjoin

When choosing bestrasvecjoin, RaVeN will perform the join based on a set of chained spatial predicates. The goal is to create a match only if the interior of two spatial objects overlaps. This is done to avoid any inaccuracies where two spatial objects may only connect on their boundaries. The 9IM for this predicate is as follows:

$$ bestrasvecjoin =

\begin{pmatrix} T & * & * \

  • & * & * \
  • & * & * \end{pmatrix} $$

At the moment, this predicate can only be used with Sedona due to limitations within the systems or their interpretation of pixels.

Parameters

To facilitate easy benchmarking, RaVeN also supports setting parameters that are tested automatically. Most of these parameters change how the datasets are ingested and are therefore handled by the preprocessor. Most of these parameters take in arrays. RaVeN then automatically creates all possible combinations and executes them as benchmark runs within a benchmark set (see database).

All possible parameters as well as some sample values:

  parameters:
    raster_format:
      - geotiff
      - jp2
    rasterize_format:
      - geotiff
      - jp2
    raster_target_crs:
      - epsg:4326
      - epsg:25833
    raster_tile_size:
      - 1x100
      - 10x10
      - 100x1
      - auto
    raster_depth:
      - 1
      - 12
    raster_resolution:
      - 1
      - 0.5

    vector_format:
      - shp
      - geojson
    vectorize_format:
      - geojson
      - shp
    vectorize_type:
      - polygons
      - points
    vector_target_crs:
      - epsg:4326
    vector_resolution:
      - 0.8
      - 0.32

    align_to_crs: vector | raster | both # align vector crs to raster crs or vice versa or do 2 runs
    align_crs_at_stage:
      - preprocess
      - execution

  iterations: 2
  warm_starts: 3
  timeout: 10800
Parameter Value Range Default Description In use?
raster_format {geotiff, jp2} input file format The format of the raster file at the point of ingestion Yes
rasterize_format {geotiff,jp2} geotiff The format a vector file shall be rasterized to somewhat
raster_target_crs a valid EPSG code input file CRS The CRS to target Yes
raster_tile_size {$x \in \N^+$ x $y \in \N^+$, auto} auto The tile size within a system Yes
raster_depth $d \in \N^+$ Full depth The depth of a raster file in bits No
raster_resolution $r \in (0, 1]$ $1.0$ The resolution of the raster file as a factor No
vector_format {shp, geojson} input file format The format of the vector file at the point of ingestion Yes
vectorize_format {shp, geojson} shp The format a raster file shall be vectorized to No
vectorize_type {points, polygons} points The interpretation of raster pixels if vectorization takes place Yes
vector_target_crs a valid EPSG code input file CRS The CRS to target Yes
vector_resolution $r \in (0, 1]$ $1.0$ The resolution of the vector file as a factor No
align_to_crs {raster, vector,both} None or vector Which CRS of which dataset shall be aligned Yes
align_crs_at_stage {preprocess, execution} preprocess When to perform a CRS alignment Yes
iterations $i \in \N^+$ 1 The number of iterations for each complete benchmark set Yes
warm_starts $w \in \N$ 3 The number of warm starts Yes
timeout $t \in \N$ 10800 sec When a query shall time out during the execution phase

Benchmark Parameters

The user has the option to define a set of benchmark parameters. These parameters are used by the BenchmarkRunFactory to create consecutive benchmark runs that automatically test the specified parameter combinations. Modifiable parameters include those affecting file types the datasets should have prior to ingestion, when and how to modify the CRS of the datasets, selecting the type of vectorization the raster dataset should use if required by the system, and the tile size of the raster dataset. If a paramater is not defined explicitly, it will be set automatically using some default value.

For example, we can choose three different benchmark parameters: The raster_tile_size defines a list of tile sizes to test, align_to_crs defines that in this case the CRS of the vector dataset should be aligned to the CRS of the raster dataset, and align_crs_at_stage states, that the CRS should be aligned in the preprocess stage. Implicitly, RaVeN will e.g. set the vector_file_type and raster_file_type parameters to the file types the datasets are provided in, in this case SHP and GeoTIFF.

The parameters can also define options like the target file format for both raster and vector files, information on raster tile sizes, and the vectorization_type, which is especially important for vector-based systems. There also exist the raster_depth, raster_resolution, and vector_resolution that can modify the raster and vector datasets directly. Initially we intended to base dataset classes on only one dataset and then reduce resolution and depth of the datasets on-the-fly using GDAL for the raster transformations and a suitable algorithm like the Ramer-Douglas-Peucker algorithm for reducing the vertex-to-feature ratio in vector datasets. Because we found enough real-world datasets to base our dataset classes on, these parameters serve no function in the current version of RaVeN.

After the parameters have been loaded, they are checked for accuracy and consistency in the FileIO class to prevent any misconfigurations. Also, some final fixes are applied to shape the configs properly. Then, before the data is returned to the main routine, duplicates are eliminated to avoid running any benchmark twice without necessity.

Controlling, how the benchmarks are run is now the task of the controlling parameters. Currently, three of these parameters exist: warm_starts for defining, how many warm starts of the executions shall be performed per benchmark run with a default of 3, iterations for defining how often the complete benchmark run shall be repeated with a default of 1, and timeout for setting an upper limit on the execution duration with a default of 3 hours.

Systems

Finally, the user needs to define a set of systems to run the workload on. Here, they define the name of the system and the port at which they can be accessed. Internally, the systems are treated as a special kind of benchmark parameter. This also enables us to use a less complex data structure.

  systems:
    - name: postgis
      port: 25432
    - name: omnisci
      port: 6274
    - name: rasdaman
      port: 8080
    - name: sedona
    - name: beast