This repository contains basic settings for PLATOON Pipeline; it has been implemented in WP2 as part of the activities of task T2.4. The outcomes of the assessment of this pipeline is reported in deliverables D2.4, D2.8, and D4.11.
scripts
- contains scripts used for transforming sources to RDF and loading it to triple store (Virtuoso) -virtuoso-script.sh
- used to remotely connect and load data usingisql-v
tool of virtuoso on command line -load_to_virtuos.py
- used to load the transformed RDF data to virtuoso using thevirtuoso-script.sh
script -transform_and_load.py
- performs both transforming raw data to RDF and loading it virtuoso using thevirtuoso-script.sh
scriptconfiguration_files
- contains configuration files for the execution of the pipeline -config_rdfizer.ini
- configuration file for materializing the Knowledge Graph using SDM-RDFizer. -endpoints.json
- configuration file for generating the DeTrusty configuration file.docker-compose.yml
- docker compose setup for transforming data to RDF and load it toVirtuoso
triple store.
Edit config_rdfizer.ini
file as follows:
Set the main directory in the [default]
section
[default]
main_directory: ./
Main directory for this setting is now, current folder.
[datasets]
number_of_datasets: 8
output_folder: ${default:main_directory}/rdf-dump
all_in_one_file: yes
remove_duplicate: yes
name: observation-data
dbtype: mysql
In datasets
section, you can set global parameters such as the number of datasets, output folder, how dump file should be created, name of the dump file (if dump is set to be saved in one file), database type, etc.
number_of_datasets
- how many datasets to transforms (create an rdf dump for). e.g., 8output_folder
- sets where the transformed dump to be saved, e.g.,${default:main_directory}/rdf-dump
all_in_one_file
- takesyes
orno
values, and sets whether to put all datasets in one file or in a separate file names. If setyes
, then thename
parameter will be used as the name of the file, i.e., "$(name)".nt and stored inoutput_folder
remove_duplicate
- takesyes
orno
values, and sets wherther duplicates should be removed while generatting RDF triples from single source with different tables/files or multiple data sources that might have duplicate values while applying the transformation.name
- sets the name of output RDF dump, ifall_in_one_file
parameter is set toyes
.dbtype
- sets the data source type, e.g.,csv
,mysql
,postgres
,json
,xml
, etc.
Once the [default]
and [datasets]
section is configures, then you need to put as many dataset specific setting as the number_of_datasets
specified in [datasets]
section.
For example, the first dataset, wind_farm_properties
, is configured as follows
[dataset1]
name: pilot2a_wind_farm_props
user:root
password:1234
host:192.168.0.2
port:3306
db:platoon_db
mapping: ${default:main_directory}/mappings/Wind-Farm/wind-farm.ttl
In case that the data source is not a relational data base, the configuration file can be written as follows:
[dataset1]
name: pilot2a_wind_farm_props
mapping: ${default:main_directory}/mappings/Wind-Farm/wind-farm.ttl
Note on the dataset number, [dataset1]
, dataset number 1 out of 8 datasets in this configuration. Each dataset will have its own configuration param values. In the snippet above, we set the name of dataset (this name will be used if all_in_one_file
param of the global param is set to no
). Other settings include: user
and password
- user name and password of the user to access the database, host
, port
, and db
- hostname, port and database name of the dataset (dataset1->wind_farm_props), and finally the mapping
param specifies where the RML mapping rules file is located. RML mapping rules need to conform the RML Spec.
There need to be at least 8 unique names of [dataset_n]
sections specifiying the parameters of each datasets in this configuration.
Run rdfizer
tool to create the RDF dump according to the above configuration and mapping files included in this config.
- Install
SDM-RDFizer
python3 -m pip install rdfizer
- Then run
rdfizer
script
cd PLATOONPipeline/
python3 -m rdfizer -c config.ini
This will create the RDF dumps according the configuration file, config.ini
.
- Run the docker compose file included in this repository.
(Prerequisite: Docker-ce, Docker-compose)
docker-compose up -d
- Then run
rdfizer
script and load data to virtuoso
- Transform data
The docker container created above using the docker-compose.yaml file will attach this repository as volume at /data
endpoint. So running rdfizer
script as follows will yield the same result as Option 1
above.
cd PLATOONPipeline/
docker exec -it sdmrdfizer python3 -m rdfizer -c /data/config.ini
This will create the RDF dumps according the configuration file, config.ini
, and store the RDF dump in /data/
volume, which in turn in "PLATOONPipeline/".
You can find the raw RDF file in .nt
serialization inside
- Load the RDF dump to Virtuoso
To load the generated RDF dump in step 2, we will use a script included in /data/scripts/
folder as follows:
docker exec -it sdmrdfizer python3 /data/scripts/load_to_virtuoso.py
OR to transform and load data automatically, run the following:
docker exec -it sdmrdfizer python3 /data/scripts/transform_and_load.py -c /data/config.ini
transform_and_load.py
script performs the transformation step and loading to virtuoso after the transformation is performed.
Before running this, make sure you update the environmental variable in the docker-compose.yml
file as follows:
environment:
- SPARQL_ENDPOINT_IP=pilot2akg
- SPARQL_ENDPOINT_USER=dba
- SPARQL_ENDPOINT_PASSWD=dba
- SPARQL_ENDPOINT_PORT=1116
- SPARQL_ENDPOINT_GRAPH=http://platoon.eu/Pilot2A/KG
- RDF_DUMP_FOLDER_PATH=/data/rdf-dump
- Open http://localhost:8891/sparql on your browser
For example, write the following query to see the available classes (Concepts) in this endpoint:
SELECT DISTINCT ?Concept
WHERE {
GRAPH <http://platoon.eu/Pilot2A/KG> {
?s a ?Concept
}
} LIMIT 1000
- Configuring DeTrusty
Generate the source description for DeTrusty:
docker exec -it DeTrusty bash -c 'create_rdfmts.py -s /path/to/your/config/file -j'
The configuration file for create_rdfmts.py is a JSON mapping the endpoint URL to the mapping files. Example:
{
"http://example.org": { "mappings": ["/mappings/Wind-Farm/wind-farm.ttl"] }
}
Then tell DeTrusty to reload the configuration:
docker exec -it detrusty /DeTrusty/Scripts/restart_workers.sh
- Executing Query with DeTrusty
This API call is used to send a query to the federation and retrieve the result. Queries can be executed from http://localhost:5000/sparql. The result will be returned as a JSON (see example below).
Example call:
curl -X POST -d "query=SELECT ?s WHERE { ?s a <http://dbpedia.org/ontology/Scientist> } LIMIT 10" localhost:5000/sparql
Example output for the above query (shortened to two results):
{
"cardinality": 10,
"execution_time": 0.1437232494354248,
"output_version": "2.0",
"head": { "vars": ["s"] },
"results": {
"bindings": [
{
"__meta__": { "is_verified": True },
"s": {
"type": "uri",
"value": "http://dbpedia.org/resource/A.E._Dick_Howard"
}
},
{
"__meta__": { "is_verified": True },
"s": {
"type": "uri",
"value": "http://dbpedia.org/resource/A.F.P._Hulsewé"
}
},
]
}
}
'cardinality' is the number (integer) of results retrieved, 'execution_time' (float) gives the time in seconds the query engine has spent collecting the results, 'output_version' (string) indicates the version number of the output format, i.e., to differentiate the current output from possibly changed output in the future, 'variables' (list) returns a list of the variables found in the query, 'result' is a list of dictionaries containing the results of the query, using the variables as keys; metadata about the result verification is included in the key '__meta__'. The current version returns all results as verified as can be seen in the key 'is_verified' of the metadata.
When sending a SPARQL 1.1 query with the SERVICE clause add the sparql1_1 parameter as in the following example call:
curl -X POST -d "query=SELECT ?s WHERE { SERVICE <https://dbpedia.org/sparql> { ?s a <http://dbpedia.org/ontology/Scientist> }} LIMIT 10" -d "sparql1_1=True" localhost:5000/sparql