SeMi (SEmantic Modeling machIne) is a tool to semi-automatically build large-scale Knowledge Graphs from structured sources such as CSV, JSON, and XML files. To achieve such a goal, SeMi builds the semantic models of the data sources, in terms of concepts and relations within a domain ontology. Most of the research contributions on automatic semantic modeling is focused on the detection of semantic types of source attributes. However, the inference of the correct semantic relations between these attributes is critical to reconstruct the precise meaning of the data. SeMi covers the entire process of semantic modeling:
- it provides a semi-automatic step to detect semantic types;
- it exploits a novel approach to inference semantic relations, based on a graph neural network trained on background linked data.
Semantic models can be formalized as graphs, where leaf nodes represent the attributes of the data source and the other nodes and relationships are defined by the ontology.
Considering the following JSON file in the public procurement domain:
{
"contract_id": "Z4ADEA9DE4",
"contract_object": "Excavations",
"proponent_struct": {
"business_id": "80004990927",
"business_name": "municipality01"
},
"participants":
[
{
"business_id": "08106710158",
"business_name": "company01"
}
]
}
And consider the following domain ontology related to public procurement:
the resulting semantic model is:
Before installing SeMi, you need to check the following requirements.
To download SeMi, you can run the commands available here.
To install SeMi, you can use the following instructions.
Using the following scripts, you can generate a semantic model starting from an target source and a domain ontology.
Semantic types (or semantic labels) consist of a combination of an ontology class and an ontology data property. To perform the semantic types detection process you need to execute two different scripts. The first script is the following:
$ node run/semantic_label_indexer.js pc data/pc/input/
pc
is the Elasticsearch index name.data/pc/input/
is the input folder containing files that have to be indexed.
This step is necessary to create the Elasticsearch index used as reference to detect the semantic types. The second script is the following:
$ node run/semantic_label.js pc data/pc/input/Z4ADEA9DE4.json data/pc/semantic_types/Z4ADEA9DE4_st_auto.json
pc
is the Elasticsearch index name.data/pc/input/Z4ADEA9DE4.json
is the input file.data/pc/semantic_types/Z4ADEA9DE4_st_auto.json
is the automatically-generated semantic type.
In SeMi, we consider the semantic types detection as a semi-automatic task.
For this reason, the manual-refined version of the semantic type is available in the file:
data/pc/semantic_types/Z4ADEA9DE4_st.json
Below an image that represents semantic types.
The Multi-edge and Weighted Graph (MEWG) includes all plausible semantic models of a data source based on a domain ontology. To create such graph, you can run the following commands:
$ node run/graph.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/ontology/ontology.ttl rdfs:domain rdfs:range owl:Class data/pc/semantic_models/Z4ADEA9DE4
data/pc/semantic_types/Z4ADEA9DE4_st.json
is the input semantic type file.data/pc/ontology/ontology.ttl
is the domain ontology file.rdfs:domain
is the domain property in the ontology.rdfs:range
is the range property in the ontology.owl:Class
is the property in the ontology to identify classes.data/pc/semantic_models/Z4ADEA9DE4
is used as output path for the generation of the graph in different formats.
This script generates two types of graph:
data/pc/semantic_models/Z4ADEA9DE4.graph
is the multi-edge and weighted graph.data/pc/semantic_models/Z4ADEA9DE4_graph.json
is a beautified representation of the weighted graph.
Below an image that represents the MEWG:
To create the Steiner Tree on the MEWG: you can run the following command:
$ node run/steiner_tree.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/semantic_models/Z4ADEA9DE4
data/pc/semantic_types/Z4ADEA9DE4_st.json
is the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_graph.json
is the beautified representation of the weighted graph.data/pc/semantic_models/Z4ADEA9DE4
is used as output path for the generation of the steiner tree in different formats.
This script generates two types of steiner trees:
data/pc/semantic_models/Z4ADEA9DE4.steiner
is the steiner tree.data/pc/semantic_models/Z4ADEA9DE4_steiner.json
is a beautified representation of the steiner tree.
Below an image that represents a steiner tree.
For the automatic generation of the semantic model, you can run the following command:
$ node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_steiner.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4
data/pc/semantic_types/Z4ADEA9DE4_st.json
is the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_steiner.json
is the beautified representation of the steiner tree.data/pc/ontology/classes.json
is the list of all classes in the ontology.data/pc/semantic_models/Z4ADEA9DE4.query
is the output JARQL semantic model.
Below an example of the semantic model serialized using SPARQL and JARQL syntax:
CONSTRUCT {
?Contract0 dcterms:identifier ?cig.
?Contract0 rdf:type pc:Contract.
?Contract0 rdfs:description ?oggetto.
?Contract0 rdf:type pc:Contract.
?BusinessEntity0 dcterms:identifier ?strutturaProponente__codiceFiscaleProp.
?BusinessEntity0 rdf:type gr:BusinessEntity.
?BusinessEntity1 dcterms:identifier ?partecipanti__identificativo.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?BusinessEntity1 rdfs:label ?partecipanti__ragioneSociale.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?BusinessEntity1 dcterms:identifier ?aggiudicatari__identificativo.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?BusinessEntity1 rdfs:label ?aggiudicatari__ragioneSociale.
?BusinessEntity1 rdf:type gr:BusinessEntity.
?Contract0 pc:contractingAuthority ?BusinessEntity0.
?Contract0 pc:contractingAuthority ?BusinessEntity1.
}
WHERE {
?root a jarql:Root.
OPTIONAL { ?root jarql:cig ?cig. }
OPTIONAL { ?root jarql:oggetto ?oggetto. }
OPTIONAL { ?root jarql:strutturaProponente ?strutturaProponente. }
OPTIONAL { ?strutturaProponente jarql:codiceFiscaleProp ?strutturaProponente__codiceFiscaleProp. }
OPTIONAL { ?root jarql:partecipanti ?partecipanti. }
OPTIONAL { ?partecipanti jarql:identificativo ?partecipanti__identificativo. }
OPTIONAL { ?root jarql:partecipanti ?partecipanti. }
OPTIONAL { ?partecipanti jarql:ragioneSociale ?partecipanti__ragioneSociale. }
OPTIONAL { ?root jarql:aggiudicatari ?aggiudicatari. }
OPTIONAL { ?aggiudicatari jarql:identificativo ?aggiudicatari__identificativo. }
OPTIONAL { ?root jarql:aggiudicatari ?aggiudicatari. }
OPTIONAL { ?aggiudicatari jarql:ragioneSociale ?aggiudicatari__ragioneSociale. }
BIND (URI(CONCAT('http://purl.org/procurement/public-contracts/contract/',?cig)) as ?Contract0)
BIND (URI(CONCAT('http://purl.org/goodrelations/v1/businessentity/',?strutturaProponente__codiceFiscaleProp)) as ?BusinessEntity0)
BIND (URI(CONCAT('http://purl.org/goodrelations/v1/businessentity/',?partecipanti__identificativo)) as ?BusinessEntity1)
}
In order to create the KG resulting from the initial semantic model, you have to run the JARQL tool with the following command:
$ ./jarql.sh data/pc/input/Z4ADEA9DE4.json data/pc/semantic_models/Z4ADEA9DE4.query > data/pc/output/Z4ADEA9DE4.ttl
data/pc/input/Z4ADEA9DE4.json
is the input file.data/pc/semantic_models/Z4ADEA9DE4.query
is the semantic model in the JARQL format.data/pc/output/Z4ADEA9DE4.ttl
is the output RDF file serialized in turtle.
Below an example of the generated RDF file:
<http://purl.org/procurement/public-contracts/contract/Z4ADEA9DE4>
<http://purl.org/dc/terms/identifier>
"Z4ADEA9DE4"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://purl.org/procurement/public-contracts#contractingAuthority>
<http://purl.org/goodrelations/v1/businessentity/03382820920> , <http://purl.org/goodrelations/v1/businessentity/80004990927> ;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/procurement/public-contracts#Contract> ;
<http://www.w3.org/2000/01/rdf-schema#description>
"C.E. 23 Targa E9688 ( RIP.OFF.PRIVATE ) MANUTENZIONE ORDINARIA MEZZI DI TRASPORTO"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://purl.org/goodrelations/v1/businessentity/03382820920>
<http://purl.org/dc/terms/identifier>
"03382820920"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/goodrelations/v1#BusinessEntity> ;
<http://www.w3.org/2000/01/rdf-schema#label>
"CAR WASH CARALIS DI PUSCEDDU GRAZIANO C S N C"^^<http://www.w3.org/2001/XMLSchema#string> .
<http://purl.org/goodrelations/v1/businessentity/80004990927>
<http://purl.org/dc/terms/identifier>
"80004990927"^^<http://www.w3.org/2001/XMLSchema#string> ;
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://purl.org/goodrelations/v1#BusinessEntity> .
The approach for generating the initial semantic model has a main limit: the steiner tree within the graph includes the shortest path to connect semantic type classes, however it does not necessarily express the correct semantic description of the target source. For this reason, a refinement process is required in order to identify a more accurate semantic model.
The semantic model refinement requires to prepare the training, the test, and the validation datasets as input of the deep learning model. Such model is a graph neural network and its main goal is to reconstruct the linked data edges using the latent representation of entities and properties. The architecture of the graph neural network is an auto-encoder composed of:
- An encoder called Relational Graph Convolutional Networks (R-GCNs).
- A decoder called DistMult.
The training, the test, and the validation datasets are built splitting a linked data repository (background knowledge) that is built through the semantic models defined by the domain experts on various sources, which are similar to the target source.
In our example, the input sources are available in the data/pc/input
folder and the ground-truth semantic model is available in the semi/data/learning_datasets/pc.query
file.
The background linked data is available in the data/pc/learning_datasets/complete.ttl
file. This background knowledge is then splitted in the following datasets:
- the training dataset available in the
data/pc/learning_datasets/training.ttl
file; - the validation dataset available in the
data/pc/learning_datasets/valid.ttl
file; - the test dataset available in the
data/pc/learning_datasets/test.ttl
file.
For the graph neural network training, you can launch the following script:
python src/link_prediction/link_predict.py --directory data/pc/learning_datasets/ --train data/pc/learning_datasets/training.ttl --valid data/pc/learning_datasets/valid.ttl --test data/pc/learning_datasets/test.ttl --score pc --parser PC --gpu 0 --graph-batch-size 1000 --n-hidden 100 --graph-split-size 1
--directory data/pc/learning_datasets/
is the directory in which entity and property dictionaries are stored. In addition, this directory stores also the trained model with its related outputs..--train data/pc/learning_datasets/training.ttl
is the file containing the training facts.--valid data/pc/learning_datasets/valid.ttl
is the file containing the validation facts.--test data/pc/learning_datasets/test.ttl
is the file containing the test facts.--score pc
is the subdirectory in which the scores resulting from the training and the evaluation process will be stored.--parser PC
is the parameter to drive the construction of the dictionaries of entities and relationships.--gpu 0
is the parameter to establish how many GPUs (if available) can be used to train the model.--graph-batch-size 1000
is a parameter to indicate the number of edges extracted at each step with the graph sampling process.--n-hidden 100
is an hyperparameter of the model to define the number of neurons (and consequently the dimension of the embeddings) at each network layer.--graph-split-size 1
is a parameter to establish the portion of edges used as positive examples.
The outputs of the training stage are the following:
entities.dict
: dictionary that maps ids to entity URIs.relations.dict
: dictionary that maps ids to property URIs.model_state.pth
: python version of the trained model.train.npy
: numpy representation of the training dataset.valid.npy
: numpy representation of the validation dataset.test.npy
: numpy representation of the test dataset.emb_nodes.json
: JSON with entity embeddings.emb_rels.json
: JSON with object property embeddings.score.json
: fact scores obtained on the test data set.
The goal of this stage to refine the edge weights of the MEWG exploiting embedding obtained from the graph neural netwrk training. In this way, we incorporate the information from the background knowledge, in order to improve the accuracy of the semantic model.
The first step is to produce the JARQL representation of the MEWG:
$ node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4_plausible
data/pc/semantic_types/Z4ADEA9DE4_st.json
is the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_graph.json
is the beautified representation of the weighted graph.data/pc/ontology/classes.json
is the list of all classes in the ontology.data/pc/semantic_models/Z4ADEA9DE4_plausible.query
is the output JARQL of plausible semantic models.
Then, you can proceed with the refinement process with the following command:
node run/refinement.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/model_datasets/scores/pc/6000/score.json data/pc/semantic_models/Z4ADEA9DE4_steiner.json data/pc/semantic_models/Z4ADEA9DE4_graph.json data/pc/semantic_models/Z4ADEA9DE4
data/pc/semantic_types/Z4ADEA9DE4_st.json
is the semantic type file.data/pc/model_datasets/scores/pc/6000/score.json
is the score file generated during the training at the epoch 6000data/pc/semantic_models/Z4ADEA9DE4_steiner.json
is the beautified version of the initial semantic model file generated through the steiner tree algorithm.data/pc/semantic_models/Z4ADEA9DE4_graph.json
is the beautified version of weighted graph file including all plausible semantic models.
This script generates two different outputs:
data/pc/semantic_models/Z4ADEA9DE4_refined.graph
is the refined semantic model file.data/pc/semantic_models/Z4ADEA9DE4_refined_graph.json
is the beautified version of the refined semantic model file.
Below an image that represents the refined semantic model.
For the generation of the refined semantic model serialized in JARQL, you need to run the following command:
node run/jarql.js data/pc/semantic_types/Z4ADEA9DE4_st.json data/pc/semantic_models/Z4ADEA9DE4_refined_graph.json data/pc/ontology/classes.json data/pc/semantic_models/Z4ADEA9DE4_refined
data/pc/semantic_types/Z4ADEA9DE4_st.json
is the semantic type file.data/pc/semantic_models/Z4ADEA9DE4_refined_graph.json
is the beautified version of the refined semantic model file.data/pc/ontology/classes.json
is the list of all classes in the ontology.
This script generates as output the following file:
data/pc/semantic_models/Z4ADEA9DE4_refined.query
is the JARQL serialization of the refined semantic model.
In order to create the KG resulting from the refined semantic model, you have to run the JARQL tool with the following command:
$ ./jarql.sh data/pc/input/Z4ADEA9DE4.json data/pc/semantic_models/Z4ADEA9DE4_refined.query > data/pc/output/Z4ADEA9DE4_refined.ttl
data/pc/input/Z4ADEA9DE4.json
is the input file.data/pc/semantic_models/Z4ADEA9DE4_refined.query
is the refined semantic model in the JARQL format.data/pc/output/Z4ADEA9DE4_refined.ttl
is the output RDF file serialized in turtle.