Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occurrence cube downloads #1978

Open
MortenHofft opened this issue Nov 4, 2024 · 15 comments
Open

Occurrence cube downloads #1978

MortenHofft opened this issue Nov 4, 2024 · 15 comments

Comments

@MortenHofft
Copy link
Member

MortenHofft commented Nov 4, 2024

ping @peterdesmet and @MattBlissett

I've started the work of adding a new download option

And an SQL ui for downloads

Could you please help evaluate if this is functionally what you had in mind?
Functionally there are 2 things that aren't working

If the functions are as you expected, then we could think about usability (bearing in mind that we will have to reimplement this in the recent future)

  • Do we need articles help pages to refer to?
  • More help texts in the UI?
  • different labels?
  • Extend the general download API with a description field, which we could auto generate (from the form) for more a human readable context.
  • allow comments in the SQL?
  • What is a good name for the download type. And what title?
  • How to make this meaningful with the table we have for other formats to show what you can expect
  • does an about page for SQL editor makes sense. placeholder for now. If so someone should write it
  • ...?
@peterdesmet
Copy link
Member

peterdesmet commented Nov 6, 2024

Nice!! Here's some feedback.

Download page

  • Download type name: Cube rather than SQL cube
  • Coordinates column can be ✔ (if selected). Currently this uses a symbol that is different than others ( not )

Modal

  • Title Download cube is fine

  • Modal help text:

    This download format allows you to aggregate occurrences by their taxonomic, temporal and/or spatial properties. For example, a data cube can be configured to aggregate occurrences by family, month and grid cell of the European Environment Agency reference grid (three dimensions) and count the number of occurrences (a measure) per combination. The result is a CSV file.

    Once configured, a SQL query will be created to generate the data cube. For more advanced use, it is possible to further customize the query by editing the created SQL.

    You can read more about species occurrence cubes here.

  • Dimensions help text:

    A dimension represents an aspect along which data can be aggregated. Selecting a higher resolution (e.g. species over family, date over year, 100 m over 10 km) will result in more categories and therefore more records.

  • Taxonomic dimension help text:

    This dimension aggregates occurrences by their taxonomic rank.

  • Temporal dimension help text:

    This dimension aggregates occurrences by time.

  • Spatial dimension help text:

    This dimension aggregates occurrences in a spatial grid.

  • Spatial resolution help text:

    The size of each grid cell.

  • I would update the values and order for Spatial resolution + add sections:

Global
- Military grid reference systems (MGRS)
- Extended quarter degree grid (QDGC)
- ISEA3H grid
Europe
- EEA reference grid
  • The values for Spatial resolution should probably be updated slightly. For one, I think that the spatial resolution for MGRS is in meters.

  • EEA spatial resolution: add a space between value and unit (1 km not 1km etc.)

  • MGRS spatial resolution: move finest resolution to top of dropdown, so values are in order. I don't know what this value entails though (I thought 1 m was the finest)

  • Can we have better spatial resolution labels than Level 0-6 for EXTENDED_QUARTER_DEGREE_GRID? E.g. name them after how big they are in degrees. Input from @MattBlissett needed.

  • Can we have better spatial resolution labels than Level 0-22 for ISEA3H_GRID? E.g. name them after how big they are (in meters?). Input from @MattBlissett needed.

  • Randomize points within uncertainty circle help text:

    For occurrence records with a coordinate uncertainty that covers more than one grid cell, should a random cell be chosen? If no is chosen, then the cell containing the centroid of the record is used.

  • Labels are good, would use regular case for Randomize points within uncertainty circle

  • Rename Measurements to Measures

  • Measures help text:

    A calculated quantitative value for each combination of dimensions.

  • Currently unclear that occurrence count is always included as a measurement. How to best indicate this?

  • Occurrence count (always included) help text:

    The number of occurrences.

  • Occurrence count at higher taxonomic level help text:

    Additional higher taxonomic ranks for which the number of occurrences should also be included. Useful to assert sampling bias.

  • Include minimum coordinate uncertainty help text:

    The lowest recorded coordinate uncertainty (in meters). Useful to assert the spatial precision of the data.

  • Include minimum temporal uncertainty help text:

    The lowest recorded temporal uncertainty (in seconds). Useful to assert the temporal precision of the data.

SQL editor

  • I think comments in SQL are fine to me, not sure if they are retained by the query string though
  • Update help text at bottom to (no then):

The easiest way to download and explore data is via the occurrence search user interface. But for complex queries and aggregations, the SQL editor provides more freedom.

@MortenHofft
Copy link
Member Author

Thanks @peterdesmet

On the SQL editor, I would include the link to the occurrence search in the text, rather than a button:

I agree it is nicer, it is only because it is makes life easier for translators. Having them write markdown with variables have caused issues in the past.

@MortenHofft
Copy link
Member Author

MortenHofft commented Nov 11, 2024

The values for Spatial resolution should probably be updated slightly. For one, I think that the spatial resolution for MGRS is in meters.

Yeah I know those are wrong. I'm waiting for you or Matt to tell me what they should be please. I've changed the MGRS as you specified above

EXTENDED_QUARTER_DEGREE_GRID should be?
ISEA3H_GRID should be?

@MortenHofft
Copy link
Member Author

MortenHofft commented Nov 11, 2024

I've added mock help texts to all fields and added 2 mock articles (one for sql download and one for cubes).

Help texts
If someone with better english skills and understanding can correct the help texts that would be great. Alternatively I can also try my best, it is just a type of thing that takes me forever. If you believe some fields are self explanatory, then let me know and I can remove the help text.

  • review help texts
  • Style help texts a bit, the amount of help texts make it all a bit bland I find. More spacing would probably help.

Articles
For the articles: then someone needs to write them if we still want them.
https://www.gbif-uat.org/occurrence-cubes
https://www.gbif-uat.org/occurrence/download/sql#about

  • write tool text
  • write cube article

Known API bugs

  • the field naming for order is different between environments
  • downloads do not work in UAT, not sure about other env

Download pages
Arriving at a download page is confusing if you come from a cube download format. You configured a cube via a UI, and then arrive at an SQL string. It is a requirement to display this better. One way about it could be to add a new feature to downloads generally.

  • API option to attach a human readable description of a download when doing a download.
  • Auto generate a human readable description for cube downloads.
  • Always show available descriptions on download pages.

That is just one idea. Other ideas for how to make the transition easier for users are welcome

Other

  •  review grid resolution translation names
  • Ask coms and data products to provide feedback, refine styling and text.

@timrobertson100
Copy link
Member

Thanks. I think the text helps in guiding the user.

I think adding the ability to give it a human readable type / description would be good. Alternatively, we could introduce a cube download format in the API itself, which takes the form parameters but does the SQL conversion behind the backend API. The reason to do that, would be to display to the parameters used on the download page which is shown from the DOI. A user could still "open this in the SQL builder" before submitting to do more complicated queries, but it'd hide SQL completely for anyone who didn't. I don't know what would be the more scalable option.

@peterdesmet
Copy link
Member

  1. @MortenHofft I have reviewed the modal and the help text. See my updated Occurrence cube downloads #1978 (comment). Two inputs from @MattBlissett needed.

  2. Is the functionality ready for testing?

  3. Do we need a separate (stable) help page at https://www.gbif-uat.org/occurrence-cubes describing the functionality or is it sufficient to refer to https://techdocs.gbif.org/en/data-use/data-cubes? To be assed by communication team.

  4. @timrobertson100 having an endpoint "which takes the form parameters" would indeed be better documentation of the dimensions in the recorded metadata, which in turn would help conversion to e.g. EBV Cubes. Just having the SQL statement doesn't tell us anything about the dimensions that were selected, since the columns can be named however they want.

@timrobertson100
Copy link
Member

On 4. please see our proposed approach here. A JSON object (the context) would hold the submitted form parameters.

@MortenHofft
Copy link
Member Author

In agreement with Matt we will add a field called machineDescription. When creating a download it can be provided as

{
  "machineDescription": {"any": "thing"}, <====
  "creator": "gbif_username",
  "sendNotification": true,
  "notification_address": [
    "[email protected]"
  ],
  "format": "DWCA",
  "predicate": {
    "type": "and",
    "predicates": [
      {
        "type": "equals",
        "key": "COUNTRY",
        "value": "FR"
      },
      {
        "type": "equals",
        "key": "YEAR",
        "value": "2017"
      }
    ]
  },
  "verbatimExtensions": [
    "http://rs.tdwg.org/ac/terms/Multimedia"
  ]
}

And in response for `occurrence/download/[key]

{
  "key": "0009286-190918142434337",
  "doi": "10.15468/dl.merqrl",
  "license": "http://creativecommons.org/licenses/by-nc/4.0/legalcode",
  "request": {
    "machineDescription": {"any": "thing"}, <====
    "predicate": {
      "type": "and",
      "predicates": [
        {
          "type": "equals",
          "key": "BASIS_OF_RECORD",
          "value": "PRESERVED_SPECIMEN",
          "matchCase": false
        }
      ]
    },
    "sendNotification": true,
    "format": "SIMPLE_CSV",
    "type": "OCCURRENCE",
    "verbatimExtensions": []
  },
  "created": "2019-10-03T08:36:21.458+00:00",
  "modified": "2025-02-04T13:54:12.090+00:00",
  "eraseAfter": "2026-02-04T13:54:12.080+00:00",
  "status": "SUCCEEDED",
  "downloadLink": "https://api.gbif.org/v1/occurrence/download/request/0009286-190918142434337.zip",
  "size": 103768055,
  "totalRecords": 845705,
  "numberDatasets": 167
}

The API doesn't exist yet, but I can start implementing a UI based on above.

@MattBlissett
Copy link
Member

While checking the specification, I found these differences in the UI. At least this one is important to fix:

  • The UI needs to exclude occurrences where a value is wider than a dimension, e.g. add AND speciesKey IS NOT NULL where the measure is on speciesKey; add AND year IS NOT NULL where the dimension is on year.

For measure → occurrence count at higher taxonomic level — the difference from the specification might be intended:

  • "family SHOULD be selected by default for cubes with a taxonomic dimension at taxon level (acceptedKey, taxonKey), species level (speciesKey) or genus level (genusKey). The direct higher rank SHOULD be selected by default for other cubes with a higher taxonomic dimension."

  • "It SHOULD NOT be possible to select more than one rank. Note that it is theoretically possible to provide this measure for all (higher) ranks."

I'll do the tasks assigned to me above next week.

@MortenHofft
Copy link
Member Author

From specs provided in word document

Measures

Occurrence count at higher taxonomic level (multiple select, default: all above, limit options based on chosen taxonomic dimension)

@peterdesmet what to do here? Above quote from our word document is different from bullet 2+3 above

@dnoesgaard
Copy link
Member

I have now written a short page about cubes: https://www.gbif.org/occurrence-cubes (visuals pending @javiesgm) and an About page for the the SQL downloads editor: https://www.gbif.org/occurrence/download/sql#about

Feedback appreciated!

@peterdesmet
Copy link
Member

Regarding bullet 2 & 3. The current implementation is:

0 (non selected): (option hidden)
1 exact taxon:    kingdom, phylum, class, order, family, genus
2 accepted taxon: kingdom, phylum, class, order, family, genus
3 species:        kingdom, phylum, class, order, family, genus
4 genus:          kingdom, phylum, class, order, family
5 family:         kingdom, phylum, class, order
6 order:          kingdom, phylum, class
7 class:          kingdom, phylum
8 phylum:         kingdom
9 kingdom:        (option hidden)

The specs are:

family SHOULD be selected by default for cubes with a taxonomic dimension at taxon level (acceptedKey, taxonKey), species level (speciesKey) or genus level (genusKey).

That is the case: family is selected by default for all four levels (1, 2, 3, 4). Other (higher) ranks are also selected by default, but that is fine: they are always higher than the selected level + the user can turn this off.

The direct higher rank SHOULD be selected by default for other cubes with a higher taxonomic dimension.

That is also the case (5, 6, 7, 8). Other (higher) ranks are also selected by default, but that is fine.

I think the UI has an easy to follow and sensible approach, that follows the specs.

Let me know if you need input on @MattBlissett's bullet point 1.

@MortenHofft
Copy link
Member Author

Thanks @peterdesmet
You do not address bullet 3 explicitly

"It SHOULD NOT be possible to select more than one rank. Note that it is theoretically possible to provide this measure for all (higher) ranks."

But, it sounds like you believe the added option to select more ranks is fine?

@MortenHofft
Copy link
Member Author

For bullet 1 I've added additional restrictions via the predicate.

taxonomicDimension
taxonomicDimension = KINGDOM { type: 'isNotNull', parameter: 'KINGDOM_KEY' }
taxonomicDimension = PHYLUM { type: 'isNotNull', parameter: 'PHYLUM_KEY' }
...
taxonomicDimension = EXACT_TAXON { type: 'isNotNull', parameter: 'TAXON_KEY' }
taxonomicDimension = ACCEPTED_TAXON { type: 'isNotNull', parameter: 'ACCEPTED_TAXON_KEY' }

temporalDimension

  • YEAR isNotNull
  • YEAR and MONTH isNotNull
  • YEAR, MONTH and DAY isNotNull

spatialDimension
I just add the same if any spatial dimension is selected. So no filtering on europe for EEA_REFERENCE_GRID

{
  type: 'equals',
  key: 'HAS_COORDINATE',
  value: 'true',
}

@peterdesmet
Copy link
Member

@MortenHofft

You do not address bullet 3 explicitly.

Oh, indeed. But it seems that was a misguided requirement. 😄 I don't see why we should limit it to one. I'll take a note of that when I do a minor revision of the requirements.

@dnoesgaard thanks for the pages, I'll try to review those next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants