Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_stac: band order remark #488

Open
soxofaan opened this issue Dec 14, 2023 · 10 comments · May be fixed by #491
Open

load_stac: band order remark #488

soxofaan opened this issue Dec 14, 2023 · 10 comments · May be fixed by #491

Comments

@soxofaan
Copy link
Member

soxofaan commented Dec 14, 2023

"description": "Loads data from a static STAC catalog or a STAC API Collection and returns the data as a processable data cube. A batch job result can be loaded by providing a reference to it.\n\nIf supported by the underlying metadata and file format, the data that is added to the data cube can be restricted with the parameters `spatial_extent`, `temporal_extent` and `bands`. If no data is available for the given extents, a `NoDataAvailable` exception is thrown.\n\n**Remarks:**\n\n* The bands (and all dimensions that specify nominal dimension labels) are expected to be ordered as specified in the metadata if the `bands` parameter is set to `null`.\n* If no additional parameter is specified this would imply that the whole data set is expected to be loaded. Due to the large size of many data sets, this is not recommended and may be optimized by back-ends to only load the data that is actually required after evaluating subsequent processes such as filters. This means that the values should be processed only after the data has been limited to the required extent and as a consequence also to a manageable size.",

has this remark:

The bands (and all dimensions that specify nominal dimension labels) are expected to be ordered as specified in the metadata if the bands parameter is set to null.

What does "ordered as specified in the metadata" mean practically?
Also, doesn't this heavily depends on STAC extensions in play (if any)?

For example, take a STAC Item like this (using the eo STAC extension):

{
  "type": "Feature",
  "assets": {
    "B04": {
      "eo:bands": [{"name": "B04"}],
      ...
    "B05": {
      "eo:bands": [{"name": "B05"}],  
      ...

I assume the spirit of the remark above is to take band order ["B04", "B05"], but this comes from the "assets" mapping, which technically does not imply an order.

@soxofaan
Copy link
Member Author

cc @bossie

@bossie
Copy link

bossie commented Dec 14, 2023

FYI, my interpretation was that this "metadata" refers to the list of bands in an Item's properties or a Collection's summaries.

@m-mohr
Copy link
Member

m-mohr commented Dec 20, 2023

Yeah, this needs clarification.

Proposal:

  1. If cube:dimensions is present, use the order of values for the corresponding dimension if available.
  2. If eo:bands, raster:bands or soon bands is available in Item Properties or Collection Summaries, use the order from the array.
  3. If eo:bands, raster:bands or soon bands is available in assets:
    1. For a single data asset with bands: Use the order from the array.
    2. For multiple assets with bands: Sort the band names (following the sort process).
  4. If nothing is present, assign zero-based indices as band names (as we do in other processes such as apply_dimension) - the user probably has to experiment in this case as the order of the actual bands is not clear.

For categorical, non-band dimensions (i.e. type other) 1 and 4 apply.
x,y,z,t should be clear as they usually have an implicit order.

@m-mohr
Copy link
Member

m-mohr commented Dec 20, 2023

See also #489 for a related issue.

@soxofaan
Copy link
Member Author

To be devils advocate: that proposal looks quite complicated (lot of ifs and branches) which on its own is not very user friendly, and it also practically means that effective band order (and even band names) might suddenly change if the data provider changes/improves their STAC metadata (e.g. add eo:bands under item properties).

I wonder if it wouldn't be better to keep it simpler (including allowing the actual band order and names to be undefined if metadata is missing) and just add a strong recommendation for users to explicitly specify bands argument in load_stac to avoid surprises.

@m-mohr
Copy link
Member

m-mohr commented Dec 20, 2023

That would be another, more simple option, indeed. The list above only applies if the bands array is not specified anyway.

@m-mohr
Copy link
Member

m-mohr commented Jan 3, 2024

Let's try to simplify and still give the user something to work with:

  1. If bands is provided in load_collection / load_stac: Use the order as provided by the user.
  2. Use the order as provided in the values for the corresponding dimension, if available
  3. Fall back to the order in the file format (the bands arrays mirror what is in the file anyway)

For load_collection it's simpler, there only 1 and 2 apply.

@soxofaan
Copy link
Member Author

soxofaan commented Jan 4, 2024

  1. Use the order as provided in the values for the corresponding dimension, if available

Based on #491, I guess you mean with "values" the "values" field from "cube:dimensions" from the datacube STAC extension?

  1. Fall back to the order in the file format (the bands arrays mirror what is in the file anyway)

I'm not completely sure what you mean with "file" or "the file" in this context, but as I mentioned in #491 I think this might not be as trivial as is sounds: there might be multiple "files" in play with inconsistent band sets or band order; the "file" aspect of the data to load might be an implementation detail of the data provider and subject to change

@m-mohr
Copy link
Member

m-mohr commented Jan 4, 2024

Based on #491, I guess you mean with "values" the "values" field from "cube:dimensions" from the datacube STAC extension?

Yes.

I'm not completely sure what you mean with "file" or "the file" in this context, but as I mentioned in #491 I think this might not be as trivial as is sounds: there might be multiple "files" in play with inconsistent band sets or band order; the "file" aspect of the data to load might be an implementation detail of the data provider and subject to change

Source data, whatever that might be. Usually COGs, netCDF, ... For me this is a fall back and as such a best effort thing to give at least some clue, e.g. how GDAL does it maybe. I assume consistent STAC catalogs here. If this doesn't work we are back to "undefined" anyway, so I'm not sure whether having the file reference really hurts.

@m-mohr
Copy link
Member

m-mohr commented Aug 7, 2024

dev telco:

  1. cube:dimensions if applicable
  2. band names
  3. band indices (if multiple bands per file)
  4. asset names (if one band per file) - sorted in alphabetical order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants