Dataset Series support #298

amercader · 2024-09-02T13:50:46Z

DCAT 3 introduced a new class for Dataset Series, essentially defined as a collection of datasets that have a common characteristic (both DCA-AP and DCAT-US provide implementation guidance and examples).

Note that the definition of Dataset Series is very loose and not ncessarily restricted to time series. Some potential examples:

Budget data release monthly or yearly
Data split by country / region
Data big in size split into smaller chuncks
Geospatial data distributed in grids

The concept of "collections" of dataset is not new in CKAN and there have been previous implementations:

The core Groups feature is similar, allowing to aggregate datasets under a common category, although the link there is generally less direct (e.g. theme, category, etc)
https://github.com/6aika/ckanext-collection (not to be confused with another ckanext-collection) uses a custom group type to achieve similar results
https://catalog.data.gov implements a "collections" mechanism, where some Collection datasets are indexed, and you can search within that collection only. Collection members also show a link to the collection dataset. I might be wrong about this, but I think collection members do no appear in the standard dataset search.

These are all conceptually similar, just a higher level entity that datasets can belong to. In DCAT terms this is expressed using the dcat:DatasetSeries class and the dcat:inSeries properties in dcat:Datasets.

ex:budget a dcat:DatasetSeries ;
  dcterms:title "Budget data"@en ;
  .
  
ex:budget-2024 a dcat:Dataset ;
  dcterms:title "Budget data for year 2024"@en ;
  dcat:inSeries ex:budget ;

dcat:DatasetSeries shares a subset of dcat:Dataset properties (e.g. in DCAT AP v3 or DCAT US v3)

When there is a liner relation between the datasets in the series, links and navigation can be implemented using dcat:first, dcat:prev, and dcat:last (and the inverse dcat:next), e.g.:

ex:budget a dcat:DatasetSeries ;
  dcterms:title "Budget data"@en ;
  dcat:first ex:budget-2018 ;
  dcat:last ex:budget-2020 ;
  .
  
ex:budget-2018 a dcat:Dataset ;
  dcterms:title "Budget data for year 2018"@en ;
  dcat:inSeries ex:budget ;
  dcat:next ex:budget-2019 ;
  .
  
ex:budget-2019 a dcat:Dataset ;
  dcterms:title "Budget data for year 2019"@en ;
  dcat:inSeries ex:budget ;
  dcat:prev ex:budget-2018 ;
  dcat:next ex:budget-2020 ;
  .
  
ex:budget-2020 a dcat:Dataset ;
  dcterms:title "Budget data for year 2020"@en ;
  dcat:inSeries ex:budget ;
  dcat:prev ex:budget-2019 ;

Potential implementations

Setting aside the navigation part of it, I think that implementing series using a custom dataset type (dataset_series) rather than Groups has more benefits. For starters, Dataset Series are subclasses of Datasets and share several properties. Secondly we can index them for free so they can be returned as results of the standard dataset search, or excluded depending on the instance needs.

A custom in_series field can be added to the dataset schema, to store the id of the dataset_series dataset it belongs to. This field will also allow member datasets to not be displayed in the default search if that is a requirement.

Navigation (prev/next) within a series should be optional.

UI changes needed:

Some indicator in a Dataset Series search result
Some indicator in a Dataset Series dataset page
A "Datasets" tab in a Dataset Series dataset page that lists the member datasets and allows to add / remove them (and reorder them, see below)
Indicator that a dataset belongs to a series in search result page
Indicator that a dataset belongs to a series in the dataset page
(Optional) Next/Prev links on the Dataset page

The linking / navigation part is what presents more challenges. We need an efficient way of

Present what datasets are previous and next at the dataset level
Present what datasets are first and last at the dataset series level

So essentially storing the order of the datasets within the series. Both previous/next and first/last could be computed at index time.

We have similar cases in CKAN core with the resource and resource view ordering. ~~Resource order is just stored as the order in which resources are stored in the database~~ Resources have a position field, but it's not comparable because all resources are updated with a single package_update call. Resource views do have a dedicated order field and the resource_view_reorder action updates all DB records.

We could follow a similar approach to resource views in series. Although it would be nice to not have to rely on a new table (dataset_series with series_id, dataset_id, order columns) I don't think we can update the order efficiently for big series by updating custom fields in the dataset.

We would need to test performance for very large series.

The text was updated successfully, but these errors were encountered:

wardi · 2024-09-02T19:45:13Z

Can a dataset be part of multiple series, e.g. separated by release and separated by region?

For ordering, the metadata should already include a field that can be used to order the results so we don't need to track order separately e.g. a release date field or a region field. Adjusting the order within a series becomes updating the metadata for a dataset.

We could allow navigation through dataset series from the display of the metadata field the series is based on, e.g. a date picker for release (only the dates with releases selectable) or a choice list for region.

How about we identify fields that define a series as part of the schema definition something like:

data_series_fields:
  # blank for non-data-series datasets, set to the same value for all datasets in the same series
  # required
  identifier_field: my_series_group
  # enable chronological series e.g. a date or year+month field:
  temporal_field: my_release_date
  # enable geographic region series e.g. a choice list of locations:
  region_field: my_region

  # enable series by chunk of data for large datasets e.g. an integer field:
  # partial_field: my_part_number
  # enable series by geographic grid identifier:
  # grid_field: my_grid

Facets in search will give us all the neighboring series datasets and we don't need new tables or extensive changes to the UI

amercader · 2024-09-10T11:01:40Z

Thanks for the feedback @wardi

Can a dataset be part of multiple series, e.g. separated by release and separated by region?

Technically yes. DCAT-AP discourages it though to avoid complexity. Also in theory there is even support for nested series, but again this would add even more complexity.

For ordering, the metadata should already include a field that can be used to order the results so we don't need to track order separately

I get how this simplifies the implementation, but wouldn't that mean that we need to support each of these cases separately, as they all have slightly different implementations (e.g. sorting by date vs alphabetically by code vs chunk of a larger file..). Perhaps we can abstract the cases a bit and have one for time-based sort, one for numeric sort and one for text sort, and maybe one that allows to define a callable for more complex sorts.

If we are not tracking order separately would you call these sorting algorithms at index time to calculate the first, last, prev and next fields? I guess this would mean re-indexing the whole series when a member is updated. On the other hand, if we compute them at view time (i.e. dataset page or when generating a DCAT representation) that might affect performance.

wardi · 2024-09-10T12:29:00Z

I like it. Having time sort, numeric sort, text sort as generic types would be more flexible.

We could have an index on the order fields in the search back end. That way the search can give us all the datasets in an series in order without needing to re-index them.

wardi · 2024-09-10T12:57:58Z

Regarding re-indexing a whole series were you thinking of the case where we've assigned a integer numeric index and need to insert a dataset in the series and bump all the later numbers?

Maybe we could store a float value instead, insert at the half-way value and somehow display integer indexes from the indexed search result

amercader added Enhancement Discussion labels Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Series support #298

Dataset Series support #298

amercader commented Sep 2, 2024 •

edited

Loading

wardi commented Sep 2, 2024

amercader commented Sep 10, 2024

wardi commented Sep 10, 2024 •

edited

Loading

wardi commented Sep 10, 2024

Dataset Series support #298

Dataset Series support #298

Comments

amercader commented Sep 2, 2024 • edited Loading

Potential implementations

wardi commented Sep 2, 2024

amercader commented Sep 10, 2024

wardi commented Sep 10, 2024 • edited Loading

wardi commented Sep 10, 2024

amercader commented Sep 2, 2024 •

edited

Loading

wardi commented Sep 10, 2024 •

edited

Loading