Skip to content

Python package to extract entities from Wikidata in function of properties

License

Notifications You must be signed in to change notification settings

euranova/wikidata_property_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual ontology from Wikidata

This package is linked to the paper Multilingual enrichment of disease biomedical ontologies to appear at the 2nd workshop on MultilingualBIO: Multilingual Biomedical Text Processing.

Wikidata is an open-source knowledge base that contains entities (ex: Q2013, the Wikidata entity). For each entity wikidata provides labels in different languages and a list of various properties. Among these properties, some identify the entity in external data sources, for example ontologies (P699 which is a property referring to the Disease Ontology) or databases (P628 referring to the approved food additives in Europe).

When working with such external sources, one can face issues regarding the monolingualism or the unavailability in some languages. The idea here is to use the multilinguality of Wikidata via the labels, to extract translations of the external sources entity from Wikidata. The method can be also used to extract synonyms of the entities in some languages.

This package offers multiple approaches to do extraction:

WARNING: This package might lead to a 24 hours IP ban from the SPARQL server of WikiData if used extensively.

Setup the User-Agent

The first step to use this package is to set up a user-agent corresponding to the Wikimedia rules of User-Agent.

This is done with the following commands.

from wikidata_property_extraction import header

header.initialize_user_agent('Example User Agent')

Please follow the rules stated in the link above to choose your user-agent.

First order Wikidata Extraction

Full ontology extraction

The easiest way is to extract all the entities in WikiData which are linked to a specific property. For example, if I want to extract all the labels and alternative labels from the entities of Orphanet in Spanish, the command will be:

from wikidata_property_extraction import translation
translate = translation.Translator('P1550', ['es'])
result_df = translate.translate()

result_df is a Pandas DataFrame with the columns: ['entity', 'value_property', 'labelEs', 'altEs'], where 'entity' is the ID of the entity in WikiData, 'value_property' is the ID of the entity in Orphanet, 'labelEs' is respectively the main label in Spanish, and 'altEs' is the alternative labels separated by a '|' of the entity in Spanish. 'labelEs' can also have labels separated by '|' in the case where the ID of the external source is linked to more than one element in Wikidata.

Thus, result_df won't have the same number of rows than the number of entities in WikiData with the property as there are only the entities which have at least one label in one of the requested languages.

Example of first line of the resulting DataFrame:

entity value_property labelEs altEs
0 https://www.wikidata.org/wiki/Q51993 319218 ébola fiebre hemorrágica del Ébola|EVE|[...]

Extraction on list of IDs

As sometimes there is no need to translate all the entities of the ontologies, there is also a function that takes as input a list of IDs, a property and a list of languages. The query will only be on the requested IDs. It will be faster for big ontologies.

The command is:

from wikidata_property_extraction import translation
translate = translation.Translator('P2586', ['pl', 'fr'])
result_df = translate.translate(id_list=['01', '02', '03', '04'])

This request queries the information of the French Departments with IDs 01, 02, 03 and 04. result_df is a Pandas DataFrame with the columns: ['entity', 'value_property', 'labelPl', 'altPl', 'labelFr', 'altFr']

entity value_property labelPl altPl labelFr altFr
0 http://www.wikidata.org/entity/Q3083 01 Ain Ain 01|département de l'Ain

Second order extraction

Another feature is the possibility to extract elements considering second order relations. In other words, going through intermediate ontologies to extend even more the elements that are extracting from WikiData.

Example:

In this example, we want to extend the entity 1551 of Orphanet. However, this entity is linked to nothing in WikiData. But it is linked to entity 121270 of OMIM, and this entity is linked to the entity Q1495005 of WikiData.

In the module second_order, a class SecondOrder can do this. It takes as input:

  • main_property_id: the main property id in WikiData, a DataFrame with
  • links_df; the links between the main ontology and the external ones, this DataFrame takes three columns with the following names:
    • value_property: the value of the main property (in the example: 1551)
    • id_auxiliary: the value of the external property (in the example: 121270)
    • name_auxiliary: the name of the external ontology (in the example: OMIM)
  • dict_properties: the dictionary of the properties, in the format 'WikiData_property': 'external_ontology_name'. Example: {'P492': 'OMIM'}
  • languages_list: the list of the languages, with the code here, for example ['cs', 'pl'] for Czech and Polish.
  • limit: the limit of elements in one query
  • all_elem: a flag to say if the translation is on the entire main property or only the elements which are in links_df.
  • url: the url of the sparql endpoint of WikiData

In the code:

import pandas as pd

from wikidata_property_extraction import second_order

links_df = pd.DataFrame([['1551', '121270', 'OMIM']], columns=['value_property', 'id_auxiliary', 'name_auxiliary']) # Here the name of the columns have no importance, only the order have one.
dict_properties = {'P492': 'OMIM'}

translator = second_order.SecondOrder('P699', links_df, dict_properties, ['cs', 'fr'], all_elem=False)
results_df = translator.translate()

The results_df dataframe is:

entity value_property labelCs altCS labelFr altFr id_auxiliary name_auxiliary source_degree
0 http://www.wikidata.org/entity/Q1495005 1551 Q1495005 carence en cuivre 121270 OMIM Second

The labelCs is the id of the WikiData entity as this entity as no translation in Czech for this entity.

Citation

If you use this package please cite this paper:

@inproceedings{bouscarrat:hal-02531140,
  TITLE = {{Multilingual enrichment of disease biomedical ontologies}},
  AUTHOR = {Bouscarrat, L{\'e}o and Bonnefoy, Antoine and Capponi, C{\'e}cile and Ramisch, Carlos},
  URL = {https://hal.archives-ouvertes.fr/hal-02531140},
  BOOKTITLE = {{2nd workshop on MultilingualBIO: Multilingual Biomedical Text Processing}},
  ADDRESS = {Marseille, France},
  YEAR = {2020},
  MONTH = May,
  KEYWORDS = {wikidata ; ontology ; translation ; biomedical},
  PDF = {https://hal.archives-ouvertes.fr/hal-02531140/file/enrichment_ontology%20%281%29.pdf},
  HAL_ID = {hal-02531140},
  HAL_VERSION = {v1},
}

About

Python package to extract entities from Wikidata in function of properties

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages