-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add schema.org markup on datasets page #12
Comments
NOTE: Schema.org's Dataset vocabulary was originally based on DCAT, which in turn used Dublin Core and FOAF terms. JKAN is based on DCAT schema. |
Would you know if search engines are able to understand embedded DCAT vocabularies (as they are implemented in JKAN in particular)? It seems there are mappings between DCAT and Schema.org already (or at least subsets of DCAT, see here) Would embedding Schema.org metadata alongside DCAT bring about any further enhancements? |
Hi @ConnectedSystems if you are talking about web search engine you might be interested in reading the article below: https://www.blog.google/products/search/making-it-easier-discover-datasets/ |
Hi @pzwsk Yes, thank you. The first link had the information I was after. Here it says: "We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format" Given JKAN already embeds DCAT markup, I'm hesitant to add Schema.org markup on top of it (it will be a lot of time and effort to do so), hence why I ask about the advantages/enhancements adding Schema.org would bring. That said, when I say "embeds DCAT", this is true only for the built-in JKAN fields. For instance, I have not embedded DCAT markup alongside the information in custom tables (e.g., fields under "RDL Hazard Info" and "Additional Info" on this page). I could add DCAT markup or Schema.org markup to these fields, but again, hesitant to do both. |
Thanks, Taku, not sure either there is a clear need at this stage to go
further. Next step is contact potential platforms that would harvest us (WB
data hub, google dataset search, etc.)
At least we can and should put in the documentation of our JKAN instance
that core metadata are exposed in DCAT format.
Best,
…On Mon, Mar 15, 2021 at 11:00 AM Takuya Iwanaga ***@***.***> wrote:
Hi @pzwsk <https://github.com/pzwsk>
Yes, thank you. The first link had the information I was after.
Here
<https://developers.google.com/search/docs/data-types/dataset#approach>
it says:
"We can understand structured data in Web pages about datasets, using
either schema.org Dataset markup, or equivalent structures represented in
W3C's Data Catalog Vocabulary (DCAT) format"
Given JKAN already embeds DCAT markup, I'm hesitant to add Schema.org
markup on top of it (it will be a lot of time and effort to do so), hence
why I ask about the advantages/enhancements adding Schema.org would bring.
That said, when I say "embeds DCAT", this is true only for the built-in
JKAN fields. For instance, I have not embedded DCAT markup alongside the
information in custom tables (e.g., fields under "RDL Hazard Info" and
"Additional Info" on this page
<http://jkan.riskdatalibrary.org/datasets/hzd-afg-dr/>).
I could add DCAT markup or Schema.org markup to these fields, but again,
hesitant to do both.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AASEVRYWL5JZEXI6JEFE2TTTDXLENANCNFSM4XWG5IFA>
.
|
Google dataset search does support both DCAT and Schema.org although they recommend the latter. I had a look at the DCAT embedded in the JKAN pages, using this RDFa extractor it seems to parse fine. Although the OGP properties are mixed in with the Dataset. So it should be possible for Google, at least, to index the pages in JKAN. I can't see the site in Google search, but they might not have harvested yet. If that covers the core requirement, then perhaps we don't need the Schema.org markup as well? Looking at the World Bank Data Hub they are embedding both sets of metadata. They're using RDFa to provide the DCAT metadata (as we are with JKAN). They'e usings a variety of extra schemas to (see example output). To provide the Schema.org metadata they're using an embedded JSON-LD block:
This probably simplifies things as it avoids having to add both set of properties across the HTML page. But still requires a conversion of the metadata to the other format. |
In reviewing this I've noticed some bugs in the DCAT metadata, as parsed by the RDFa extractor linked above:
|
Thanks for the review @ldodds Would this be resolved by having a dedicated endpoint (#14) ? Otherwise:
For clarity, this was the default behavior for files exposed via JKAN which I copied when modifying for RDL. But semantics aside, would
Sorry, I am missing something here. If we take this entry as an example, the title of the dataset is "Afghanistan agriculture", and the given resource name (the distribution) matches. Are you suggesting that the distribution should match the resource filename, or otherwise made different from the resource name?
Assuming this is the Open Graph Protocol, I suggest we disable the OGP feature. Modifying this to be more configurable is a much larger body of work. |
Currently we have a general title and description for the whole dataset page, and specific titles descriptions for each of the resources/distributions (shown on "details"). |
A Distribution is "A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles". The metadata is a description of the dataset, rather than a distribution of it. So it doesn't really fit with serving it as a Distribution in my opinion. A Catalog Record is closer perhaps but also doesn't look quite right. Portals sometimes have link to download the dataset metadata (which is displayed and embedded into the page) in different formats, e.g. on WB DH: "The information on this page (the dataset metadata) is also available in these formats...". But that's different to the resources associated with the dataset. re: "the I've included a screenshot. I used the Structured Data Sniffer extension to try and show it. Hope that helps. |
Thanks, if I understand correctly:
|
@matamadio broadly yes. Sometimes you might split a large dataset over multiple files in different ways. I think its legitimate to have those as separate distributions associated with the same dataset. The most common case is usually a single distribution per dataset. My rule of thumb is that if there's any differences in the provenance or governance of the data (e.g. its produced by a different process, or by a different organisation, or has different licensing) then it's a different dataset and will have its own distribution(s). |
Alright, then I will need to split several sets after the schema update. |
Hi @ldodds
I think the reason for this is because the base JKAN template assigns a The implemented approach may not align with DCAT completely either, as the DCAT v2 documentation appears to suggest that Datasets can represent collections of Distributions (as per your statement above re definition of "Distribution"). I've tentatively adjusted the JKAN template (only on local dev) such that In this way, each
If I've interpreted the DCAT v2 documentation example correctly (and good chance I haven't) then the
https://www.w3.org/TR/vocab-dcat-2/#ex-elaborated-bag I've also updated the |
More info on how-to here https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html
The text was updated successfully, but these errors were encountered: