Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce uniqueness for all GUID fields #6259

Open
grantfitzsimmons opened this issue Feb 20, 2025 · 4 comments
Open

Enforce uniqueness for all GUID fields #6259

grantfitzsimmons opened this issue Feb 20, 2025 · 4 comments
Labels
1 - Enhancement Improvements or extensions to existing behavior

Comments

@grantfitzsimmons
Copy link
Member

grantfitzsimmons commented Feb 20, 2025

Is your feature request related to a problem? Please describe.
We have challenges when dealing with GUID (Globally Unique Identifier) fields in Specify that are not guaranteed to be globally unique. This leads to potential data integrity issues, as duplicate GUIDs can theoretically exist across different tables or collections, causing confusion and errors in data retrieval and management. We are not following the very definition of GUID if we allow duplication as it complicates data handling and can lead to incorrect associations between records.

https://guid.one/guid

A GUID is an acronyom that stands for Globally Unique Identifier, they are also referred to as UUIDs or Universaly Unique Identifiers - there is no real difference between the two. Technically they are 128-bit unique reference numbers used in computing which are highly unlikely to repeat when generated despite there being no central GUID authority to ensure uniqueness.

Describe the solution you'd like
I propose that all GUID fields in Specify be made globally unique as a strict requirement. Per @melton-jason, this could be implemented by modifying the existing guid_rules business rule to enforce an implicit uniqueness rule for all GUID fields across the application.

This rule should be non-configurable to ensure consistency and reliability in data management.

We need to implement a mechanism to identify and manage cases of duplicated GUIDs, raising appropriate business rule exceptions when necessary. This could involve using a save blocker when editing affected records or generating a report for an administrator that indicates instances of duplication, similar to the process when configuring a new uniqueness rule for the first time.

Describe alternatives you've considered
One alternative is to keep the current system unchanged and rely on users to manage GUID uniqueness manually, but this approach is prone to human error and does not provide any checks to ensure data integrity. Another alternative is to only enforce uniqueness for certain critical tables, but this could lead to inconsistent data management practices across the application. We should really follow the definition as written.

Reported By
Andy Bentley at KU Ichthyology & Specify

Additional context
The current uniqueness rule only applies to CollectionObject -> guid and Storage -> guid, which are modifiable by the user. This inconsistency can lead to issues as highlighted in the comment by @melton-jason. If we establish a comprehensive uniqueness requirement for all GUID fields, we can improve data integrity

For reference, please see the discussion here between @melton-jason and @acbentley: GitHub Comment.

List of Tables with GUIDs

Table Name Auto-generate GUID? Uniqueness Rule Exists?
Agent
Attachment
CollectingEvent
Collection
CollectionObject
CollectionObjectProperty
Determination
Geography
GeologicTimePeriod
Institution
Journal
LithoStrat
Locality
MaterialSample
Preparation
PreparationProperty
ReferenceWork
Taxon
Storage
@grantfitzsimmons grantfitzsimmons added the 1 - Enhancement Improvements or extensions to existing behavior label Feb 20, 2025
@acbentley
Copy link

I still believe that a true GUID field should not be user-enterable even if uniqueness is enforced. Under that guise, someone could enter 12345 as a GUID and as long as it is unique for that table it would be valid. However, that entry is by no means globally unique. If we really want a user-enterable unique identifier, then call it that rather than a GUID. I recommend both having a GUID field that is entered by the system as a UUID as with other tables and a user-enterable unique identifier field to satisfy both requirements if needed.

@acbentley
Copy link

I suspect, given the push by DiSCCo and others for DOIs that we may need to support those at some point too.

@mpitblado
Copy link

@acbentley, I think I agree with you in premise, and don't really have a horse in this race. However, I would like to express that I do not think the field should be completely off limits to users, as old data may come with existing GUID's that do follow the UUID standard. For instance, when migrating existing data, 5 of our collections used the UUID v1 standard, that needed to be assigned to the GUID field on import. We use these fields as the dwc::occurrenceID, so although technically they could be changed, that requires coordination with GBIF to construct the redirects. Under the current system old can be kept, while new records get a new UUID v4 generated.

Having both a system generated and user generated one would work, but adds complexity, especially because in the case above it would fracture the field in two (old would be in one field, new would be in another). I am of the opinion that the current setup for the collection object GUID, in which new collection objects have a UUID v4 generated by default, and are read-only be default, but can still be accessed by the user, makes sense, and would make sense for other tables as well. While an institution that chooses to use 12345 as a GUID may be incorrect (from our standpoint), that represents their choice to do so, and requires effort to do so such that they are unlikely to do it accidentally. If an institution chose to use an approach other than UUID v4, then they could technically fulfill the requirements for something globally unique (maybe they use some wizardry to construct something that for all intents and purposes is complex enough to be globally unique), and I'm not sure the Specify system needs to be opinionated in this regard.

@acbentley
Copy link

@mpitblado Thanks. We just discussed that exact scenario in a meeting. I agree that there are some scenarios where you may want to copy and paste a GUID into a field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - Enhancement Improvements or extensions to existing behavior
Projects
None yet
Development

No branches or pull requests

3 participants