Occurrence cube downloads #1978

MortenHofft · 2024-11-04T15:03:03Z

I've started the work of adding a new download option

And an SQL ui for downloads

Could you please help evaluate if this is functionally what you had in mind?
Functionally there are 2 things that aren't working

downloads always fail despite being accepted by the sql validator. Perhaps: SQL validation: UAT adds iceberg to the response occurrence#360
the WHERE isn't filled. Waiting for feat: internal service to translate predicates to SQL WHERE occurrence#356

If the functions are as you expected, then we could think about usability (bearing in mind that we will have to reimplement this in the recent future)

Do we need articles help pages to refer to?
More help texts in the UI?
different labels?
Extend the general download API with a description field, which we could auto generate (from the form) for more a human readable context.
allow comments in the SQL?
What is a good name for the download type. And what title?
How to make this meaningful with the table we have for other formats to show what you can expect
does an about page for SQL editor makes sense. placeholder for now. If so someone should write it
...?

The text was updated successfully, but these errors were encountered:

peterdesmet · 2024-11-06T12:37:52Z

Nice!! Here's some feedback.

Download page

Download type name: Cube rather than SQL cube
Coordinates column can be ✔ (if selected). Currently this uses a symbol that is different than others (✔ not ✓)

Modal

Title Download cube is fine
Modal help text:

This download format allows you to aggregate occurrences by their taxonomic, temporal and/or spatial properties. For example, a data cube can be configured to aggregate occurrences by family, month and grid cell of the European Environment Agency reference grid (three dimensions) and count the number of occurrences (a measure) per combination. The result is a CSV file.

Once configured, a SQL query will be created to generate the data cube. For more advanced use, it is possible to further customize the query by editing the created SQL.

You can read more about species occurrence cubes here.
Dimensions help text:

A dimension represents an aspect along which data can be aggregated. Selecting a higher resolution (e.g. species over family, date over year, 100 m over 10 km) will result in more categories and therefore more records.
Taxonomic dimension help text:

This dimension aggregates occurrences by their taxonomic rank.
Temporal dimension help text:

This dimension aggregates occurrences by time.
Spatial dimension help text:

This dimension aggregates occurrences in a spatial grid.
Spatial resolution help text:

The size of each grid cell.
I would update the values and order for Spatial resolution + add sections:

Global
- Military grid reference systems (MGRS)
- Extended quarter degree grid (QDGC)
- ISEA3H grid
Europe
- EEA reference grid

SQL editor

I think comments in SQL are fine to me, not sure if they are retained by the query string though
Update help text at bottom to (no then):

The easiest way to download and explore data is via the occurrence search user interface. But for complex queries and aggregations, the SQL editor provides more freedom.

MortenHofft · 2024-11-07T07:49:29Z

Thanks @peterdesmet

On the SQL editor, I would include the link to the occurrence search in the text, rather than a button:

I agree it is nicer, it is only because it is makes life easier for translators. Having them write markdown with variables have caused issues in the past.

MortenHofft · 2024-11-11T09:59:46Z

The values for Spatial resolution should probably be updated slightly. For one, I think that the spatial resolution for MGRS is in meters.

Yeah I know those are wrong. I'm waiting for you or Matt to tell me what they should be please. I've changed the MGRS as you specified above

EXTENDED_QUARTER_DEGREE_GRID should be?
ISEA3H_GRID should be?

MortenHofft · 2024-11-11T13:13:50Z

timrobertson100 · 2024-11-11T15:41:33Z

Thanks. I think the text helps in guiding the user.

I think adding the ability to give it a human readable type / description would be good. Alternatively, we could introduce a cube download format in the API itself, which takes the form parameters but does the SQL conversion behind the backend API. The reason to do that, would be to display to the parameters used on the download page which is shown from the DOI. A user could still "open this in the SQL builder" before submitting to do more complicated queries, but it'd hide SQL completely for anyone who didn't. I don't know what would be the more scalable option.

peterdesmet · 2024-12-17T14:05:14Z

@MortenHofft I have reviewed the modal and the help text. See my updated Occurrence cube downloads #1978 (comment). Two inputs from @MattBlissett needed.
Is the functionality ready for testing?
Do we need a separate (stable) help page at https://www.gbif-uat.org/occurrence-cubes describing the functionality or is it sufficient to refer to https://techdocs.gbif.org/en/data-use/data-cubes? To be assed by communication team.
@timrobertson100 having an endpoint "which takes the form parameters" would indeed be better documentation of the dimensions in the recorded metadata, which in turn would help conversion to e.g. EBV Cubes. Just having the SQL statement doesn't tell us anything about the dimensions that were selected, since the columns can be named however they want.

timrobertson100 · 2024-12-17T16:21:00Z

On 4. please see our proposed approach here. A JSON object (the context) would hold the submitted form parameters.

MortenHofft · 2025-02-06T10:35:35Z

In agreement with Matt we will add a field called machineDescription. When creating a download it can be provided as

{
  "machineDescription": {"any": "thing"}, <====
  "creator": "gbif_username",
  "sendNotification": true,
  "notification_address": [
    "[email protected]"
  ],
  "format": "DWCA",
  "predicate": {
    "type": "and",
    "predicates": [
      {
        "type": "equals",
        "key": "COUNTRY",
        "value": "FR"
      },
      {
        "type": "equals",
        "key": "YEAR",
        "value": "2017"
      }
    ]
  },
  "verbatimExtensions": [
    "http://rs.tdwg.org/ac/terms/Multimedia"
  ]
}

And in response for `occurrence/download/[key]

{
  "key": "0009286-190918142434337",
  "doi": "10.15468/dl.merqrl",
  "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode",
  "request": {
    "machineDescription": {"any": "thing"}, <====
    "predicate": {
      "type": "and",
      "predicates": [
        {
          "type": "equals",
          "key": "BASIS_OF_RECORD",
          "value": "PRESERVED_SPECIMEN",
          "matchCase": false
        }
      ]
    },
    "sendNotification": true,
    "format": "SIMPLE_CSV",
    "type": "OCCURRENCE",
    "verbatimExtensions": []
  },
  "created": "2019-10-03T08:36:21.458+00:00",
  "modified": "2025-02-04T13:54:12.090+00:00",
  "eraseAfter": "2026-02-04T13:54:12.080+00:00",
  "status": "SUCCEEDED",
  "downloadLink": "https://api.gbif.org/v1/occurrence/download/request/0009286-190918142434337.zip",
  "size": 103768055,
  "totalRecords": 845705,
  "numberDatasets": 167
}

The API doesn't exist yet, but I can start implementing a UI based on above.

MattBlissett · 2025-02-16T17:13:07Z

While checking the specification, I found these differences in the UI. At least this one is important to fix:

The UI needs to exclude occurrences where a value is wider than a dimension, e.g. add AND speciesKey IS NOT NULL where the measure is on speciesKey; add AND year IS NOT NULL where the dimension is on year.

For measure → occurrence count at higher taxonomic level — the difference from the specification might be intended:

"family SHOULD be selected by default for cubes with a taxonomic dimension at taxon level (acceptedKey, taxonKey), species level (speciesKey) or genus level (genusKey). The direct higher rank SHOULD be selected by default for other cubes with a higher taxonomic dimension."
"It SHOULD NOT be possible to select more than one rank. Note that it is theoretically possible to provide this measure for all (higher) ranks."

I'll do the tasks assigned to me above next week.

MortenHofft · 2025-02-17T12:34:20Z

From specs provided in word document

Measures

Occurrence count at higher taxonomic level (multiple select, default: all above, limit options based on chosen taxonomic dimension)

@peterdesmet what to do here? Above quote from our word document is different from bullet 2+3 above

dnoesgaard · 2025-02-17T15:59:51Z

I have now written a short page about cubes: https://www.gbif.org/occurrence-cubes (visuals pending @javiesgm) and an About page for the the SQL downloads editor: https://www.gbif.org/occurrence/download/sql#about

Feedback appreciated!

peterdesmet · 2025-02-17T19:30:01Z

Regarding bullet 2 & 3. The current implementation is:

0 (non selected): (option hidden)
1 exact taxon:    kingdom, phylum, class, order, family, genus
2 accepted taxon: kingdom, phylum, class, order, family, genus
3 species:        kingdom, phylum, class, order, family, genus
4 genus:          kingdom, phylum, class, order, family
5 family:         kingdom, phylum, class, order
6 order:          kingdom, phylum, class
7 class:          kingdom, phylum
8 phylum:         kingdom
9 kingdom:        (option hidden)

The specs are:

family SHOULD be selected by default for cubes with a taxonomic dimension at taxon level (acceptedKey, taxonKey), species level (speciesKey) or genus level (genusKey).

That is the case: family is selected by default for all four levels (1, 2, 3, 4). Other (higher) ranks are also selected by default, but that is fine: they are always higher than the selected level + the user can turn this off.

The direct higher rank SHOULD be selected by default for other cubes with a higher taxonomic dimension.

That is also the case (5, 6, 7, 8). Other (higher) ranks are also selected by default, but that is fine.

I think the UI has an easy to follow and sensible approach, that follows the specs.

Let me know if you need input on @MattBlissett's bullet point 1.

MortenHofft · 2025-02-18T08:15:09Z

Thanks @peterdesmet
You do not address bullet 3 explicitly

"It SHOULD NOT be possible to select more than one rank. Note that it is theoretically possible to provide this measure for all (higher) ranks."

But, it sounds like you believe the added option to select more ranks is fine?

MortenHofft · 2025-02-18T09:13:00Z

For bullet 1 I've added additional restrictions via the predicate.

taxonomicDimension
taxonomicDimension = KINGDOM { type: 'isNotNull', parameter: 'KINGDOM_KEY' }
taxonomicDimension = PHYLUM { type: 'isNotNull', parameter: 'PHYLUM_KEY' }
...
taxonomicDimension = EXACT_TAXON { type: 'isNotNull', parameter: 'TAXON_KEY' }
taxonomicDimension = ACCEPTED_TAXON { type: 'isNotNull', parameter: 'ACCEPTED_TAXON_KEY' }

temporalDimension

YEAR isNotNull
YEAR and MONTH isNotNull
YEAR, MONTH and DAY isNotNull

spatialDimension
I just add the same if any spatial dimension is selected. So no filtering on europe for EEA_REFERENCE_GRID

{
  type: 'equals',
  key: 'HAS_COORDINATE',
  value: 'true',
}

peterdesmet · 2025-02-18T19:07:56Z

@MortenHofft

You do not address bullet 3 explicitly.

Oh, indeed. But it seems that was a misguided requirement. 😄 I don't see why we should limit it to one. I'll take a note of that when I do a minor revision of the requirements.

@dnoesgaard thanks for the pages, I'll try to review those next week.

MortenHofft added the Needs clarification label Nov 4, 2024

MortenHofft mentioned this issue Dec 17, 2024

Add a context property to downloads gbif/occurrence#371

Open

MortenHofft added a commit that referenced this issue Feb 5, 2025

attach machine description to sql downloads. #1978

c9deb7f

MortenHofft added a commit that referenced this issue Feb 6, 2025

show machineDescriptions on SQL downloads. See #1978 (comment)

6ebac24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occurrence cube downloads #1978

Occurrence cube downloads #1978

MortenHofft commented Nov 4, 2024 •

edited

Loading

peterdesmet commented Nov 6, 2024 •

edited by MortenHofft

Loading

MortenHofft commented Nov 7, 2024

MortenHofft commented Nov 11, 2024 •

edited

Loading

MortenHofft commented Nov 11, 2024 •

edited

Loading

timrobertson100 commented Nov 11, 2024

peterdesmet commented Dec 17, 2024

timrobertson100 commented Dec 17, 2024

MortenHofft commented Feb 6, 2025

MattBlissett commented Feb 16, 2025

MortenHofft commented Feb 17, 2025

dnoesgaard commented Feb 17, 2025

peterdesmet commented Feb 17, 2025

MortenHofft commented Feb 18, 2025

MortenHofft commented Feb 18, 2025

peterdesmet commented Feb 18, 2025

Occurrence cube downloads #1978

Occurrence cube downloads #1978

Comments

MortenHofft commented Nov 4, 2024 • edited Loading

peterdesmet commented Nov 6, 2024 • edited by MortenHofft Loading

Download page

Modal

SQL editor

MortenHofft commented Nov 7, 2024

MortenHofft commented Nov 11, 2024 • edited Loading

MortenHofft commented Nov 11, 2024 • edited Loading

timrobertson100 commented Nov 11, 2024

peterdesmet commented Dec 17, 2024

timrobertson100 commented Dec 17, 2024

MortenHofft commented Feb 6, 2025

MattBlissett commented Feb 16, 2025

MortenHofft commented Feb 17, 2025

dnoesgaard commented Feb 17, 2025

peterdesmet commented Feb 17, 2025

MortenHofft commented Feb 18, 2025

MortenHofft commented Feb 18, 2025

peterdesmet commented Feb 18, 2025

MortenHofft commented Nov 4, 2024 •

edited

Loading

peterdesmet commented Nov 6, 2024 •

edited by MortenHofft

Loading

MortenHofft commented Nov 11, 2024 •

edited

Loading

MortenHofft commented Nov 11, 2024 •

edited

Loading