-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query Subsetting #10
Comments
For countries with a small number of authors such as Luxembourg(Q32) we can use the wikidata-entity-extractor tool to extract a subset of authors from the country. This tool will make use of the wikidata API to concurrently collect all the data associated with each entity resulting from the execution of the query. |
After mounting a local Blazegraph instance with the Luxembourg authors subset we have executed the authors subquery for both Wikidata and Blazegraph endpoints. SELECT DISTINCT ?author WHERE {
?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
} We have obtained the following results in milliseconds:
As you can see, there is an improvement in the performance of the query with Blazegraph's local endpoint. |
At this point, next step was to try running the full country authors query for Luxembourg with the authors subquery federated to our local Blazegraph instance from the WDQS: WITH {
SELECT DISTINCT ?author WHERE {
SERVICE<http://156.35.82.22/bigdata/sparql>{
?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
}
}
} AS %authors However, that's only possible for a limited list of endpoints. So we opted to go the other way round by executing the full query under the Blazegraph endpoint and federating the other part of the query to wikidata: SELECT
?number_of_citing_works
?author ?authorLabel
?organization ?organizationLabel
?example_work ?example_workLabel
WITH {
SELECT DISTINCT ?author WHERE {
?author wdt:P27 | wdt:P1416/wdt:P17 | wdt:P108/wdt:P17 wd:Q32 .
}
} AS %authors
WITH {
SELECT
?author
(COUNT(DISTINCT ?citing_work) AS ?number_of_citing_works)
(SAMPLE(?organization_) AS ?organization)
(SAMPLE(?work) AS ?example_work)
WHERE {
INCLUDE %authors
SERVICE<https://query.wikidata.org/sparql>{
?work wdt:P50 ?author .
OPTIONAL { ?citing_work wdt:P2860 ?work . }
OPTIONAL {
?author wdt:P1416 | wdt:P108 ?organization_ .
?organization_ wdt:P17 wd:q32
}
}
}
GROUP BY ?author
} AS %results
WHERE {
INCLUDE %results
}
ORDER BY DESC(?number_of_citing_works) We´ve compared the performance between the execution of the original query from WDQS and this query from the local Blazegraph instance and the results (in milliseconds) are as follows:
As you can see, in this case, there is no improvement in performance. In fact, the results of the federated query are much worse than those of the original query. |
Another approach for improving the performance in scholia queries could be to extract a subset of entities for each query. For example, in country authors query the first subquery extracts all the authors for a given country:
Then, for each author, a further query is made.:
Our aim is to extract a subset of these authors in order to decrease the overall query time.
The text was updated successfully, but these errors were encountered: