Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synolib2 #14

Draft
wants to merge 62 commits into
base: main
Choose a base branch
from
Draft

Synolib2 #14

wants to merge 62 commits into from

Conversation

nleanba
Copy link
Collaborator

@nleanba nleanba commented Oct 27, 2024

Full architecture rework.

Breaking Changes all-around.

Open TODOs:

  • COL-taxa without authority
  • Check if the behaviour is the same on all supported endpoints (those listed on synospecies)
  • Optional: Ability to have the deprecations link for COL-taxa (currently we have "is deprecated by" but not "deprecates"
  • Ability to resolve one of the URIs to a Name:
    i.e. a way to await synoGroup.getExistingName(uri) which resolves with the respective Name (or authorizedName if applicable) if it is a synonym as soon as it is found or with some indication that this is not a synonym (await synoGroup.getExistingName(uri) does not do any requests!)
    This would allow the treatment details to show the actual names of the things, not just the uri.
  • Optional: Add reason to justifications (is deprected by vs deprectates)
  • Optional: Handle seeAlso: coluri triples in the treatmentdata linking col taxa to taxon-concepts
  • Optional: Get the "tree" of a taxon, that is all higher taxa (for display purposes)
    This is currently not possible for COL-taxa on lindas, as there many are missing.
  • Handle TN/TC with missing Kingdom
  • Force use of (recent) cached entry if present (like stale-while-revalidate example from here: https://developer.mozilla.org/en-US/docs/Web/API/Request/cache#examples)
  • Consider not using all the COALESCEs in the queries, instead sort out partial homonyms client-side -- possibly splitting queries in tow parts (TN/TC and CoL), as this might speed up the query times.
  • When using the "subtaxa" option, results are missing if it doesnt find anything for the search term -- either consider everything lower-rank which has the search term as genus directly, or be more lenient in what taxon-names show up (currently many Genus w/o treatment dont show up, and so dont their subtaxa)
  • See below:

SynonymGroup.ts Outdated
OPTIONAL { ?col dwc:scientificNameAuthorship ?colAuth . } BIND(COALESCE(?colAuth, "") as ?authority)
?col dwc:scientificName ?name . # Note: contains authority
?col dwc:genericName ?genus .
# TODO # ?col dwc:parent* ?p . ?p dwc:rank "kingdom" ; dwc:taxonName ?kingdom .
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the other similar "# TODO #" lines are a hack to work around lindas not having all CoL-taxa for some reason?

Should be uncommented once all col-data are on lindas; this check prevents some weird edgecases like for example with "Quercus"

SynonymGroup.ts Outdated
FILTER NOT EXISTS { ?col dwc:specificEpithet ?species . }
FILTER NOT EXISTS { ?tn dwc:species ?species . }
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, a problem is that this query wuth the UNIONs is much slower on lindas than on our server; however making them optional and FILTERing with


    OPTIONAL {
      ?col dwc:specificEpithet ?species .
      ?tn dwc:species ?tnspecies .
      OPTIONAL {
        ?col dwc:infraspecificEpithet ?infrasp .
        ?tn dwc:subspecies|dwc:variety ?tninfrasp .
      }
    }

    FILTER (COALESCE(?tnspecies, "N A") = COALESCE(?species, "N A"))
    FILTER (COALESCE(?tninfrasp, "N A") = COALESCE(?infrasp, "N A"))

does not work at all on our server, but is quite fast on lindas.

I am considering having both query-variants, and selecting depending on the endpoint which one to use. This should be SynoLib-user overridabe, but have hardcoded defaults for known servers (i.e. for those on the synospecies settings)

@nleanba

This comment was marked as duplicate.

@nleanba
Copy link
Collaborator Author

nleanba commented Oct 31, 2024

Endpoint-Compatability:

Compare e.g. for Tyrannosaurus rex:

  • Plazi Endpoint: Found 42 names with 162 treatments. This took 28.069 seconds.
  • Lindas: Found 42 names with 162 treatments. This took 173.008 seconds.

Update after 9663f68:

Times are now down to ~25s (plazi endpoint) vs ~31s (lindas) in my very unscientifc testing.

I think this is good enough for now.

The missing CoL taxa on lindas are still a problem. However, it should not be the case that current synolib/synospecies finds something but new synolib does not regardless of the CoL-taxa

@nleanba
Copy link
Collaborator Author

nleanba commented Nov 8, 2024

  • Investigate duplication of results, e.g. for 'Mini' or 'Ficus'
    • Ficus seems to be because it is a homonym appering in multiple kingdoms, need to fix the missing kingdoms on lindas.
      1. add kingdom field to Name
      2. allow for missing kingdom and treat it as a separate one
      3. therefore, handle it similarly to species etc wrt existing & equal or bith non-exist
  • Deduplicate Authorities/Collapse authorized names where appropriate:
    • i.e. check if a name is a subset of another (er al and firstnames), or a reordering
    • combine into one authorizedName with multiple uris

@nleanba
Copy link
Collaborator Author

nleanba commented Nov 8, 2024

  • Investigate number or requests made, seems like a lot, maybe wait with requests where heuristically a request for the same Name is already in-flight
    • Consider not loading treatment details for cite-only treatments or hold off until user requests it (i.e. UI only shows "Cited by ## treatments" and an expand-button
    • Can we collapse the figure query into the other?
  • Check or inconsistencies between getNameFrom_-queries

@nleanba
Copy link
Collaborator Author

nleanba commented Dec 18, 2024

  • Material Citations can have multiple GBIF Occurrence IDs, deduplicate this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate CoL-Data Remove need for queries in SynoSpecies
1 participant