Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema.org markup on datasets page #12

Open
pzwsk opened this issue Feb 16, 2021 · 14 comments
Open

Add schema.org markup on datasets page #12

pzwsk opened this issue Feb 16, 2021 · 14 comments
Assignees

Comments

@pzwsk
Copy link

pzwsk commented Feb 16, 2021

More info on how-to here https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html

@pzwsk pzwsk self-assigned this Feb 16, 2021
@matamadio
Copy link

matamadio commented Mar 2, 2021

NOTE: Schema.org's Dataset vocabulary was originally based on DCAT, which in turn used Dublin Core and FOAF terms. JKAN is based on DCAT schema.

@ConnectedSystems
Copy link
Collaborator

@cgiovando

Would you know if search engines are able to understand embedded DCAT vocabularies (as they are implemented in JKAN in particular)?

It seems there are mappings between DCAT and Schema.org already (or at least subsets of DCAT, see here)

Would embedding Schema.org metadata alongside DCAT bring about any further enhancements?

@pzwsk
Copy link
Author

pzwsk commented Mar 9, 2021

Hi @ConnectedSystems if you are talking about web search engine you might be interested in reading the article below:

https://www.blog.google/products/search/making-it-easier-discover-datasets/
https://developers.google.com/search/docs/data-types/dataset#approach

@ConnectedSystems
Copy link
Collaborator

Hi @pzwsk

Yes, thank you. The first link had the information I was after.

Here it says:

"We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format"

Given JKAN already embeds DCAT markup, I'm hesitant to add Schema.org markup on top of it (it will be a lot of time and effort to do so), hence why I ask about the advantages/enhancements adding Schema.org would bring.

That said, when I say "embeds DCAT", this is true only for the built-in JKAN fields. For instance, I have not embedded DCAT markup alongside the information in custom tables (e.g., fields under "RDL Hazard Info" and "Additional Info" on this page).

I could add DCAT markup or Schema.org markup to these fields, but again, hesitant to do both.

@pzwsk
Copy link
Author

pzwsk commented Mar 15, 2021 via email

@ldodds
Copy link

ldodds commented Apr 14, 2021

Google dataset search does support both DCAT and Schema.org although they recommend the latter.

I had a look at the DCAT embedded in the JKAN pages, using this RDFa extractor it seems to parse fine. Although the OGP properties are mixed in with the Dataset.

So it should be possible for Google, at least, to index the pages in JKAN. I can't see the site in Google search, but they might not have harvested yet. If that covers the core requirement, then perhaps we don't need the Schema.org markup as well?

Looking at the World Bank Data Hub they are embedding both sets of metadata. They're using RDFa to provide the DCAT metadata (as we are with JKAN). They'e usings a variety of extra schemas to (see example output).

To provide the Schema.org metadata they're using an embedded JSON-LD block:

<script type="application/ld+json">...</script>

This probably simplifies things as it avoids having to add both set of properties across the HTML page. But still requires a conversion of the metadata to the other format.

@ldodds
Copy link

ldodds commented Apr 14, 2021

In reviewing this I've noticed some bugs in the DCAT metadata, as parsed by the RDFa extractor linked above:

  • I don't think the metadata download should be marked up as a distribution, as it doesn't contain the data.
  • the title property seems to be picking up the titles of the distributions, not the dataset title. That might be related to the above
  • the OGP description property should have same value as the dataset description?

@ConnectedSystems
Copy link
Collaborator

Thanks for the review @ldodds

Would this be resolved by having a dedicated endpoint (#14) ?

Otherwise:

I don't think the metadata download should be marked up as a distribution, as it doesn't contain the data.

For clarity, this was the default behavior for files exposed via JKAN which I copied when modifying for RDL.
But I guess this goes back to how "data" is defined/framed.
From my perspective this metadata is data describing the dataset, and is made available as a distributed resource.

But semantics aside, would dcat:CatalogRecord be more acceptable?
(I suspect not but trying to find a suitable alternative).

the title property seems to be picking up the titles of the distributions, not the dataset title. That might be related to the above

Sorry, I am missing something here.

If we take this entry as an example, the title of the dataset is "Afghanistan agriculture", and the given resource name (the distribution) matches.

Are you suggesting that the distribution should match the resource filename, or otherwise made different from the resource name?

the OGP description property should have same value as the dataset description?

Assuming this is the Open Graph Protocol, I suggest we disable the OGP feature.
The base JKAN implementation is set up to include OGP at the page level (hence why you're seeing different values), and the only configuration provided is "on or off".

Modifying this to be more configurable is a much larger body of work.

@matamadio
Copy link

matamadio commented Apr 21, 2021

Currently we have a general title and description for the whole dataset page, and specific titles descriptions for each of the resources/distributions (shown on "details").
Related to what was said in GFDRR/rdl-standard#7, would it be better to split resources and have more univocal/precise metadata? Ie. one dataset page > one distribution.

@ldodds
Copy link

ldodds commented Apr 21, 2021

A Distribution is "A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles".

The metadata is a description of the dataset, rather than a distribution of it. So it doesn't really fit with serving it as a Distribution in my opinion. A Catalog Record is closer perhaps but also doesn't look quite right.

Portals sometimes have link to download the dataset metadata (which is displayed and embedded into the page) in different formats, e.g. on WB DH: "The information on this page (the dataset metadata) is also available in these formats...". But that's different to the resources associated with the dataset.

re: "the title property. I just meant that there's something wrong with the embedded RDFa markup. When extracting the metadata from the page, the dcterms:title property ends up with two values: "Afghanistan agriculture" and "Metadata". There's also only a single Distribution with the title of "Afghanistan agriculture". So something not quite right in there, but I've not identified the source of the problem.

I've included a screenshot. I used the Structured Data Sniffer extension to try and show it.

Screenshot from 2021-04-21 13-57-22

Hope that helps.

@matamadio
Copy link

Thanks, if I understand correctly:

  • different distributions should always refer to the same dataset (different version or language), and then should not be used for different subsets (e.g. one distribution is residential exposure and another one is industrial exposure);
  • those should be distinct datasets instead

@ldodds
Copy link

ldodds commented Apr 21, 2021

@matamadio broadly yes. Sometimes you might split a large dataset over multiple files in different ways. I think its legitimate to have those as separate distributions associated with the same dataset. The most common case is usually a single distribution per dataset.

My rule of thumb is that if there's any differences in the provenance or governance of the data (e.g. its produced by a different process, or by a different organisation, or has different licensing) then it's a different dataset and will have its own distribution(s).

@matamadio
Copy link

Alright, then I will need to split several sets after the schema update.

@ConnectedSystems
Copy link
Collaborator

ConnectedSystems commented Apr 27, 2021

Hi @ldodds

re: "the title property. I just meant that there's something wrong with the embedded RDFa markup. When extracting the metadata from the page, the dcterms:title property ends up with two values: "Afghanistan agriculture" and "Metadata".

I think the reason for this is because the base JKAN template assigns a dcterms:title property for each associated resource entry (and I subsequently used as a basis for the RDL metadata file). At the same time, JKAN assigns a dcat:Dataset property for each file resource, but does not assign a distribution tag at all, so it seems all properties get lumped together with the parent, page-level, specifications (hence why all the dcterms:title tags get lumped together).

The implemented approach may not align with DCAT completely either, as the DCAT v2 documentation appears to suggest that Datasets can represent collections of Distributions (as per your statement above re definition of "Distribution").

I've tentatively adjusted the JKAN template (only on local dev) such that dcat:Dataset property is provided on a once-per-page basis, with all resources listed therein marked as dcat:Distribution.

In this way, each dcterms:title gets associated with a Distribution.

image

The metadata is a description of the dataset, rather than a distribution of it. So it doesn't really fit with serving it as a Distribution in my opinion. A Catalog Record is closer perhaps but also doesn't look quite right.

If I've interpreted the DCAT v2 documentation example correctly (and good chance I haven't) then the Distribution type can also be used for accompanying metadata, as given in the example linked/shown below

dcat:distribution [
      rdf:type dcat:Distribution ;
      dct:title "RDF/XML representation of the ontology used for the data"@en ;
      dcat:downloadURL <http://resource.geosciml.org/ontology/timescale/gts.rdf> ;
      dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml> ;
]

https://www.w3.org/TR/vocab-dcat-2/#ex-elaborated-bag

I've also updated the dcat:accessURL for Distributions and rdl-metadata files to dcat:downloadURL.
Again these changes are only made locally until we are in agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants