Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include column for Wikidata identifier, suggestion #53

Open
ppKrauss opened this issue Sep 27, 2017 · 8 comments
Open

Include column for Wikidata identifier, suggestion #53

ppKrauss opened this issue Sep 27, 2017 · 8 comments

Comments

@ppKrauss
Copy link
Contributor

ppKrauss commented Sep 27, 2017

Wikipedia have stable pages for all countries, and Wikidata supply an ID for it. Today Wikidata IDs are playing important role as "concept identifier", for Web Semantic in general and for open projects like OpenStreetMaps, etc.

Example: BR is https://www.wikidata.org/wiki/Q155 , so the column wd_id of line BR is Q155. With Wikidata API we can fill automatically the wd_id column.

@ewheeler
Copy link
Contributor

Good suggestion @ppKrauss

Is there any listing of these stable country pages on wikidata? I've not found a listing/category for these or a way to crawl/fetch them all programmatically

@ppKrauss
Copy link
Contributor Author

Hi @ewheeler, thanks (!), I will check best strategy next week. There are two ways,

  1. Use a list of countries at Wikipedia as source, parsing it by a little adaptation in this wikitext2CSV script. Audit advantages: is human readable and audited by English-Wikipedia community.

  2. Use SparQL and trust only in Wikidata, looking for all instances of Q6256... Or use some trusted DBpedia (as Wikidata curators) algorithm to get it.

The item 2 is the ideal solution and generates an automatic CSV.

@ppKrauss
Copy link
Contributor Author

ppKrauss commented Oct 24, 2017

Testing solution of item 2,

SELECT ?item ?itemLabel 
WHERE {
  ?item wdt:P31 wd:Q6256.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

run this query here and download as CSV to check JOIN.


Perhaps better! A CSV with only Wikidata-ID and 2-letter-country-code columns:

SELECT * 
WHERE {
  ?item wdt:P297 ?code
} ORDER BY ?code

here.

@ppKrauss
Copy link
Contributor Author

Migration problem

Hi @ewheeler , can you help to check cause of errors at https://github.com/ppKrauss/country-codes ?
The dataset is good, but terminal goodtables datapackage.json say that no.

Wikidata minor problem

I am using SQL to check and JOIN... The JOIN is:

  SELECT  c.*, w.item as "wdId" 
  FROM dataset.vw_country_codes c LEFT JOIN wikidata_country w 
    ON w.code=c.iso3166_1_alpha_2 AND c.iso3166_1_alpha_2 IS NOT NULL 
    AND w.item NOT IN ('Q165783', 'Q2895', 'Q1249802', 'Q29999', 'Q407199', 'Q838261')

The wdId nulls are for Namibia and Sark only.

item code action
Q165783 BQ delete
Q27561 BQ preserve
Q2895 BY delete
Q184 BY preserve
Q1249802 FK delete
Q9648 FK preserve
Q29999 NL delete
Q55 NL preserve
Q407199 PS delete
Q219060 PS preserve
Q838261 YU delete
Q83286 YU preserve

The duplicated pairs are about Wikidata's records on "grouping nations" as "Kingdom of the Netherlands" in the NL pair.

ppKrauss added a commit to ppKrauss/country-codes that referenced this issue Aug 3, 2018
@ppKrauss
Copy link
Contributor Author

ppKrauss commented Aug 3, 2018

Hi @ewheeler, sorry for coming back so late ... Now the problems are solved, all be automatic.

Submiting pull request 65 to add sh wd_countries.sh in your makefile.

Supposing that you prefer to adapt your Python scripts to the join, a new column wd_id. You can join the tables on iso2_code=ISO3166-1-Alpha-2.

Only Sark is not there, because have no iso2_code, but you can add as Q3405693.

Wikidata have persistent IDs (it's safe!), so the rule of the thumb is to preserve the older Wikidata ID (wd_id) of a country when somebody try to duplicate it editing Wikidata. For "future new nations" the rule is to check Wikidata Item at the stable English Wikipedia page. The "manual filter" is the grep line at wd_countries.sh, and is cumulative.

@valerio-bozzolan
Copy link

What is the blockage at the moment? Is any help needed on this? :) Thank you so much!

@rufuspollock
Copy link
Member

@valerio-bozzolan PR is welcome to add this.

@anuveyatsu
Copy link
Member

Cool stuff - I'm only seeing this now 👍🏼 we have this old PR that we should merge #65

This was referenced Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants