Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate adding an evenly-spaced numerical code for categorical datasets #48

Open
julietcohen opened this issue Apr 23, 2024 · 1 comment · May be fixed by #58
Open

Automate adding an evenly-spaced numerical code for categorical datasets #48

julietcohen opened this issue Apr 23, 2024 · 1 comment · May be fixed by #58

Comments

@julietcohen
Copy link
Collaborator

Some datasets submitted to the PDG are categorical data rather than numeric continuous data. These categorical codes may come in the form of strings, such as in the permafrost and ground ice dataset that has 4 categories of permafrost coverage, each described with a letter, and they are ordered (least to most permafrost coverage). Other categorical codes come as numbers, but the number does not represent magnitude or order. Instead, each number represents a string, such as in the SACHI_v2 infrastructure dataset that has 7 different infrastructure types.

For ordered categorical datasets like the permafrost and ground ice coverage dataset, we want to use a palette that has 4 distinct shades of 1 or 2 colors, like light blue to dark blue, to show the rank of each category. But unlike a continuous dataset, there should only be 4 shades, nothing in between.

For unordered categorical datasets like the infrastructure dataset, we want to use a palette with a distinct color for each possible value, such as red, blue, green, orange, gray, etc. rather than a scaled of shades of one color or a diverging palette of a couple colors, because we want the categories to appear unrelated to each other.

In order to web tile and assign a palette in the first place, the values in the raster cells must be numbers, and cannot be strings. So for a dataset like permafrost and ground ice, we can assign the number 1 to the category with the least coverage, and 2 to the category with more coverage, 3 to the category with second to most coverage, and 4 to the category with the most coverage.

For the infrastructure dataset, the categorical values are already numbers, but they are not 1, 2, 3, 4, 5, 6, 7 (which are evenly spaced), but they are instead 11, 12, 13, 20, 30, 40, 50 (which are unevenly spaced). As a result, even if we assign a palette to this stat with 7 distinct colors, the web tiling step will fail to assign one color to one category. In order to successfully assign one color to one category, it seems that we need to translate the unevenly spaced code into an evenly spaced one, meaning we make a new attribute in the vector stage that codes every 11 as 1, 12 as 2, 13 as 3, 20 as 4, 30 as 5, etc. Then when we rasterize, we should made 2 bands per raster. One for the actual uneven code that was given in the dataset so that the numbers in the raster cells match the metadata provided by the researcher, and one band for the even code just so we can web tile that and put it on the portal.

I tried this with the infrastructure dataset and got good results. See the issue comment here. But this dataset takes a while to process, so testing my theory can be done with fewer polygons.

Reproducible Example

To test my theory with a smaller data sample, I used a small sample of IWP polygons (6,000 total polygons, all near each other on Wrangel Island) and assigned 2 new attributes to the input data for the viz workflow: code_even and code_uneven.

  • The first 2,000 polygons of the IWP gpkg get a code of 1 for code_even and 16 for code_uneven.
  • The next 2,000 polygons get a code of 2 for code_even and 20 for code_uneven.
  • The last 2,000 polygons get a code of 3 for code_even and 40 for code_uneven.

So overall, the value range of code_even is [1,3] and the value range of code_uneven is [16,40]. Note that the distributions of each value are the same, meaning that this isolates this approach from the other palette issue described in issue#35 that describes datasets with skewed distributions of the attributes we want to visualize.

I ran the viz workflow with 2 stats, one for each code. Each stat has the same palette: red, blue, and green.

viz workflow
# filepaths
from pathlib import Path
import os

# visual checks & vector data wrangling
import geopandas as gpd

# staging
import pdgstaging
from pdgstaging import TileStager

# rasterization & web-tiling
import pdgraster
from pdgraster import RasterTiler

# logging
from datetime import datetime
import logging
import logging.handlers
from pdgstaging import logging_config

config = {
  "deduplicate_clip_to_footprint": False, 
  "dir_input": "/home/jcohen/test_categorical_webtiling/data/",
  "ext_input": ".gpkg",
  "dir_staged": "staged/",
  "dir_geotiff": "geotiff/", 
  "dir_web_tiles": "web_tiles/", 
  "filename_staging_summary": "staging_summary.csv",
  "filename_rasterization_events": "raster_events.csv",
  "filename_rasters_summary": "raster_summary.csv",
  "filename_config": "config",
  "simplify_tolerance": 0.1,
  "tms_id": "WGS1984Quad",
  "z_range": [
    0,
    12
  ],
  "geometricError": 57,
  "z_coord": 0,
  "statistics": [
    {
      "name": "code_uneven",
      "weight_by": "area",
      "property": "code_uneven",
      "aggregation_method": "max",
      "resampling_method": "nearest", 
      "val_range": [
        16,
        40
      ],
      "palette": [
        "#ff1f1f",
        "#4b1fff",
        "#1fff2a"
      ],
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    },
    {
      "name": "code_even",
      "weight_by": "area",
      "property": "code_even",
      "aggregation_method": "max",
      "resampling_method": "nearest", 
      "val_range": [
        1,
        3
      ],
      "palette": [
        "#ff1f1f",
        "#4b1fff",
        "#1fff2a"
      ],
      "nodata_val": 0,
      "nodata_color": "#ffffff00"
    },
  ],
  "deduplicate_at": [None],
  "deduplicate_method": None
}

stager = TileStager(config)
stager.stage_all()

RasterTiler(config).rasterize_all()

See the difference in the output web tiles:

code_even

The first 2,000 polygons are red, the next 2,000 are blue, and the last 2,000 are green. There is no colors in between.

even

code_uneven

The first 2,000 polygons are red, the next 2,000 are not blue, but rather pink (because 20 is closer to 16 than it is to 40), and the last 2,000 polygons are green.

uneven
@katmatson
Copy link

As a general approach, how does it sound to add to the config the ability to specify that a column is categorical and what the proper ordering of the values is?

For the permafrost and ground ice coverage, for example, I'm thinking that would look something like:

categorical_value: [ { prop: 'EXTENT' values: ['I', 'S', 'D', 'C'] }, ]

viz-staging would then use this to create a new property, perhaps named something like EXTENT_normalized, where a vector with EXTENT 'I' has EXTENT_normalized set to 1, EXTENT 'S' gets EXTENT_normalized 2, etc.

For properties with categorical numerical values, are the possible values known before running the viz-staging pipelines? If so, that one config would be sufficient; if not, there'd need to be something added to keep track of all seen values for that property and then add the normalized values only after going through all of the vectors the first time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants