Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

290 Plazi datasets that are very likely to have classification issues ACC-ACC species (different authors) #380

Open
camiplata opened this issue Jan 29, 2025 · 15 comments
Assignees
Labels
fix request A fix requested for a specific paper or treatment

Comments

@camiplata
Copy link

Hi Plazi team, this issue is similar to #362.

While reviewing duplicate data on the extended release, I found that Plazi was involved in at least 616 duplicate pairs. I did several checks and found that many are due to:

  • A completely wrong classification, all names under a classification different from the original paper
  • A partially wrong classification, one or two names under the wrong, kingdom, order, class etc
  • Incomplete classification or missing ranks
  • Duplicate names within the checklist
  • Inconsistencies on the authorship, missing ones or double authorships

Here you can find those names wich are represented in 290 datasets, I recommend reviewing the datasets as an entity as in most cases there are problems in multiple names not only in the ones listed.

These datasets are very likely to have a problem, but still there may be a couple of them that may be just fine.

Thank you as always for these "clean ups", these have a very positive impact on Catalogue of Life and all our users.

@flsimoes flsimoes self-assigned this Jan 29, 2025
@flsimoes flsimoes added the fix request A fix requested for a specific paper or treatment label Jan 29, 2025
@flsimoes
Copy link

@camiplata Is there a spreadsheet for this?

@camiplata
Copy link
Author

ups forgot to load it

PLAZI-NAMES-ACC-ACC-SP-diffauthors-XR2025-01-18.csv

@flsimoes
Copy link

ups forgot to load it

PLAZI-NAMES-ACC-ACC-SP-diffauthors-XR2025-01-18.csv

Thanks!

@flsimoes
Copy link

@camiplata
Copy link
Author

I'll check!

@flsimoes
Copy link

@camiplata most of these have now been fixed

The following file contains datasets where we didn't find any clear errors, so we'll need further feedback from you as to what exactly needs to be fixed.
CamilaFeedback.xlsx

Additionally, these 4 datasets are not done yet

https://www.checklistbank.org/dataset/300248/classification
https://www.checklistbank.org/dataset/34549/classification
https://www.checklistbank.org/dataset/35256/classification
https://www.checklistbank.org/dataset/7639/classification

@camiplata
Copy link
Author

That is great Felipe, I'll have a look to the shared file and come back with comments, if any.

@camiplata
Copy link
Author

camiplata commented Feb 21, 2025

Hi Felipe, there where some without problems others where you already did some fixing, the remaining ones I will list here, it is mainly issues due to name duplicates within the same dataset.

Thank you as always!!!

@flsimoes
Copy link

@camiplata

Regarding

  1. A31C8F6DFFDCFFCE4C394B044D497F0B
    https://www.checklistbank.org/dataset/217510/duplicates?limit=50&offset=0
    There are duplicates due to the existence of "species-group" treatments. So these are correct, though I'm not sure what to do to clear the data for you.
  2. AAF5FF0F61E032BFAD683FA15F94AF54
    https://www.checklistbank.org/dataset/304774/duplicates?limit=50&offset=0
    These taxa appear in multiple treatments and are indeed correct

@flsimoes
Copy link

flsimoes commented Feb 26, 2025

@camiplata
Regarding AF21FF86FFCCFFC78D09FFF5FFBABA77, https://www.checklistbank.org/dataset/35256/classification?taxonKey=x3T
I had to fix a lot of things, thus I decided to remark the taxons
Let's wait and check once that gets updated in ClB

@camiplata
Copy link
Author

Hi Felipe, thanks for your help. For 1 and 2, I think we can handle this kind of duplicates on the merge as they are the same. The errors I'm getting with those probably need to be fixed more on our side than on yours, I just check an there are other sources involved so I'll have to set some editorial decisions.

So I believe we are almost done with this issue package!!!!!

@flsimoes
Copy link

flsimoes commented Mar 3, 2025

@camiplata
Can you check whether these 2 need fixing? And if so, what needs to be fixed?
30677137FFF670544970FFBCFFB2FFB8
3352135F4439CD52FFFF6C4F1707B424

@camiplata
Copy link
Author

Hi!

On the first one there are inconsistent duplicates:

The first duplicate pair is the same name and authorship, but one is accepted and the other a synonym.
The second duplicate is the opposite, the same name with two different authorship and both names are accepted wich is unlikely.

Image https://www.checklistbank.org/dataset/286008/duplicates?limit=50&offset=0

The second dataset looks good, I don't see any issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix request A fix requested for a specific paper or treatment
Projects
None yet
Development

No branches or pull requests

2 participants