Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misinterpretation of country - GBIF datasets #45

Open
ManonGros opened this issue Oct 1, 2020 · 10 comments
Open

Misinterpretation of country - GBIF datasets #45

ManonGros opened this issue Oct 1, 2020 · 10 comments
Labels
fix request A fix requested for a specific paper or treatment

Comments

@ManonGros
Copy link

Hi, I am trying to sort a bit some of the issues on the GBIF feedback repository and I thought I would forward you the following issues:

Let me know if you need anything!

@mguidoti
Copy link
Collaborator

mguidoti commented Oct 1, 2020

Hi @ManonGros,

Thanks for this list. I'll check with @myrmoteras how he want to handle them, and go from there.

But just so I can understand: I think gbif/portal-feedback#2420 and gbif/portal-feedback#2380 are related as the first cites the latter?

All the best,

@ManonGros
Copy link
Author

yes I think so (but I am not sure)

@ManonGros
Copy link
Author

It seems like this new issue could also be related: gbif/portal-feedback#3030

@mguidoti
Copy link
Collaborator

mguidoti commented Oct 2, 2020

I'm checking with @myrmoteras and Guido whether I'll act on these issues, or Guido will do it.

Thanks for reporting them!

@gsautter
Copy link

gsautter commented Oct 2, 2020

It's a similar problem as (gbif/portal-feedback#2380), yes ... with "Wales" as in "New South Wales" being the culprit this time ... that together with an erroneous materials citation split between "N.S." and "Wales" (in what is supposed to read "N. S. Wales") elevates "Wales" to a standalone country level, which ISO normalizes to the UK.

Geographical homonyms (if infix based ones in this instance) ... we're doing a lot more to sort these out than we did back in 2016, when this article was processed, but these legacy error remain.
The best way of going about this might well be using the stats to get all the materials citations containing "Island" (German name of Iceland, and thus all the trouble), "Wales", and "Georgia" (plain top level ambiguity in English) to get an overview, and then devise a viable cleanup strategy ... I manually corrected a lot of the "Georgia" mistakes earlier this year, but the other two might be concerning far more treatments, so some kind of automated first approach might help to at least reduce cleanup effort.

@gsautter
Copy link

gsautter commented Oct 2, 2020

That said, maybe we should add a "problematic/ambiguous country name" check in the QC?

@mguidoti
Copy link
Collaborator

mguidoti commented Oct 2, 2020

Hi @gsautter ,

I think an additional QC rule to cover this might be very interesting. Better false positives than data issues, in my opinion.

And let us cover this, then. I'll query TB Stats and divide the task among people. Sooner than later we will have this covered.

What do you think?

@gsautter
Copy link

gsautter commented Oct 2, 2020

Let's get the stats and analyze this first ... maybe we can devise some automated filter to deal with the lion's share of the obvious cases ... no need to have the office do more tedious and repetitive work than necessary ...

@mguidoti
Copy link
Collaborator

mguidoti commented Oct 4, 2020

@gsautter,

Found 175 articles that might contain the Wales/New South Wales issue... but you might be able to filter them a bit further using your own API.

Please, tell me what you think.

Thanks!

@gsautter
Copy link

gsautter commented Oct 4, 2020

Well, I don't really have an "own API" or anything ... just using the stats line anyone else ... but sure, there might be ways of filtering further, if somewhat downstream using the whole treatments, e.g. looking for other UK constituents like "Scotland" and "England", or for the presence or absence of Australia or other Australian territories ... we'll see, the data will tell.

@flsimoes flsimoes added the fix request A fix requested for a specific paper or treatment label Jun 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix request A fix requested for a specific paper or treatment
Projects
None yet
Development

No branches or pull requests

4 participants