Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[traits.build workflow] Add field for specimen identifiers #167

Open
ehwenk opened this issue Aug 1, 2024 · 5 comments
Open

[traits.build workflow] Add field for specimen identifiers #167

ehwenk opened this issue Aug 1, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@ehwenk
Copy link
Collaborator

ehwenk commented Aug 1, 2024

Add field(s) to map in specimen identifiers - such as for trait data linked to herbarium vouchers or trait data where the same specimen/individual is measured in multiple datasets.

Need to consult with ALA / GBIF to ensure we include the field(s) that are most used across global biodiversity databases. But possibly, we'll need 2 fields, one for the more generic instance of "same individual measured in different datasets" and a second more formally for herbarium vouchers.

@ehwenk ehwenk added the enhancement New feature or request label Aug 1, 2024
@ehwenk ehwenk added this to AusTraits Aug 1, 2024
@ehwenk ehwenk moved this to Backlog in AusTraits Aug 1, 2024
@ehwenk ehwenk moved this from Backlog to In Progress in AusTraits Aug 8, 2024
@ehwenk ehwenk self-assigned this Aug 8, 2024
@ehwenk
Copy link
Collaborator Author

ehwenk commented Aug 8, 2024

One of our immediate aims is to add column(s) to the traits.build database structure that allows trait observations to be linked to herbarium records or in instances when a dataset collector has a unique record number that links across trait observations in multiple datasets.

We want to be fully compliant with the DwC standard, but also minimise the number of additional fields we add to traits.build, especially as these fields will be blank for the majority of datasets.

Looking through DwC, it seems there are two distinct types of "identifiers" that probably need to be added:

  1.  A record number for casual links between observations. These are record numbers that link across datasets, but aren’t GUID’s. We’d most likely use [Dwc:recordNumber](http://rs.tdwg.org/dwc/terms/recordNumber), defined as, “An identifier given to the dwc:Occurrence at the time it was recorded. Often serves as a link between field notes and a dwc:Occurrence record, such as a specimen collector's number.”
    
  2.  An identifier that links to *ALL* herbarium vouchers, GBIF, etc. This will be either [dwc:occurrenceID](https://dwc.tdwg.org/list/#dwc_occurrenceID) or [dwc:catalogNumber](https://dwc.tdwg.org/list/#dwc_catalogNumber) although I think occurrenceID should already incorporate codes for the specific herbarium/collection, while catalogNumber would require that we also have columns for herbarium/institution (https://dwc.tdwg.org/list/#dwc_institutionCode) and maybe other details. On the other hand, within ALA, while the occurrenceID is part of the URL, it isn’t actually reported on the page.  
    
  • dwc:occurrenceID (An identifier for the dwc:Occurrence (as opposed to a particular digital record of the dwc:Occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the dwc:occurrenceID globally unique.)

  • dwc:catalogNumber (An identifier (preferably unique) for the record within the data set or collection.)

I don't think the two identifier categories can be merged or we'd be diverging from the dwc meaning of each.

As examples, see this record in ALA, GBIF:

https://biocache.ala.org.au/occurrences/60455440-c777-43d9-9cc0-19354cbc8403

https://www.gbif.org/occurrence/2430993462

The AusTraits team set out as a goal to change traits.build as little as possible, but I think before we do this we should contemplate if there are any other “occurrence” metadata fields we should be adding as part of this – at the moment we don’t explicitly include the concept of an “occurrence” in the traits.build structure. It is implicit via observationID and an observations geographic location (latitude/longitude) that to observe an organism in a location, on a date, it must have occurred there.

A few relevant references:

Nelson G, Sweeney P, Gilbert E (2018) Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens. Applications in Plant Sciences 6, e1027. doi:10.1002/aps3.1027.

Folk RA, Siniscalchi CM (2021) Biodiversity at the global scale: the synthesis continues. American Journal of Botany 108, 912–924. doi:10.1002/ajb2.1694.

@ehwenk
Copy link
Collaborator Author

ehwenk commented Aug 27, 2024

Further research suggests dwc:institutionCode will also be required to uniquely link to observations/collections in the ALA, gbif, and other collections. For instance, for the Australian Museum, the catalog number does not include the institution code.

@ehwenk
Copy link
Collaborator Author

ehwenk commented Sep 2, 2024

DarwinCore also has a field dwc:associatedSequences which allows one to link to one or more identifiers for genetic sequence information. This is a new DarwinCore addition as part of their MaterialEntity class.

@ehwenk
Copy link
Collaborator Author

ehwenk commented Sep 5, 2024

Further thoughts with @dfalster

  • Identifiers will be in a separate relational table, linked back to the traits table via observation_id
  • The identifiers table will be in long format with columns observation_id, identifier, identifier_value and identifier_comments
  • All identifiers used will come from a controlled vocabulary (specified in schema), but this can include many of the various identifiers used by other biodiversity databases, genomics aggregators, etc.
  • All identifiers will be terms that are formally defined, generally in DarwinCore, but perhaps on occasion in other vocabularies

This will be easy to implement and has the advantages that:

  1. We won't be adding many columns to the traits table
  2. We can implement this enhancement without worrying that we haven't included all necessary identifiers/consulted with all traits.build users/biodiversity portal managers and will have to change the traits.build structure repeatedly.

@ehwenk
Copy link
Collaborator Author

ehwenk commented Dec 5, 2024

Comments copied from issue #169

One of our immediate aims is to add column(s) to the traits.build database structure that allows trait observations to be linked to herbarium records or in instances when a dataset collector has a unique record number that links across trait observations in multiple datasets.

We want to be fully compliant with the DwC standard, but also minimise the number of additional fields we add to traits.build, especially as these fields will be blank for the majority of datasets.

Looking through DwC, it seems there are two distinct types of "identifiers" that probably need to be added:

 A record number for casual links between observations. These are record numbers that link across datasets, but aren’t GUID’s. We’d most likely use [Dwc:recordNumber](http://rs.tdwg.org/dwc/terms/recordNumber), defined as, “An identifier given to the dwc:Occurrence at the time it was recorded. Often serves as a link between field notes and a dwc:Occurrence record, such as a specimen collector's number.”

 An identifier that links to *ALL* herbarium vouchers, GBIF, etc. This will be either [dwc:occurrenceID](https://dwc.tdwg.org/list/#dwc_occurrenceID) or [dwc:catalogNumber](https://dwc.tdwg.org/list/#dwc_catalogNumber) although I think occurrenceID should already incorporate codes for the specific herbarium/collection, while catalogNumber would require that we also have columns for herbarium/institution (https://dwc.tdwg.org/list/#dwc_institutionCode) and maybe other details. On the other hand, within ALA, while the occurrenceID is part of the URL, it isn’t actually reported on the page.  

[dwc:occurrenceID](https://dwc.tdwg.org/list/#dwc_occurrenceID) (An identifier for the dwc:Occurrence (as opposed to a particular digital record of the dwc:Occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the dwc:occurrenceID globally unique.)

[dwc:catalogNumber](https://dwc.tdwg.org/list/#dwc_catalogNumber) (An identifier (preferably unique) for the record within the data set or collection.)

I don't think the two identifier categories can be merged or we'd be diverging from the dwc meaning of each.

As examples, see this record in ALA, GBIF:

https://biocache.ala.org.au/occurrences/60455440-c777-43d9-9cc0-19354cbc8403

https://www.gbif.org/occurrence/2430993462

The AusTraits team set out as a goal to change traits.build as little as possible, but I think before we do this we should contemplate if there are any other “occurrence” metadata fields we should be adding as part of this – at the moment we don’t explicitly include the concept of an “occurrence” in the traits.build structure. It is implicit via observationID and an observations geographic location (latitude/longitude) that to observe an organism in a location, on a date, it must have occurred there.

A few relevant references:

Nelson G, Sweeney P, Gilbert E (2018) Use of globally unique identifiers (GUIDs) to link herbarium specimen records to physical specimens. Applications in Plant Sciences 6, e1027. doi:10.1002/aps3.1027.

Folk RA, Siniscalchi CM (2021) Biodiversity at the global scale: the synthesis continues. American Journal of Botany 108, 912–924. doi:10.1002/ajb2.1694.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

1 participant