Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

pitkant · 2023-06-14T12:01:59Z

Executive summary:

Maybe pxweb package could include functions for calculating hashes for queries and downloaded datasets?
Otherwise taking into account or acknowledging the WGDC recommendations might be helpful in setting long-term goals for further package development

Background information:

Research Data Alliance Data Citation WG has listed 14 recommendations on data reproducibly subsetting datasets and how to cite, share and re-use these subsets:

2 page summary: https://zenodo.org/record/1406002
more extensive report: https://zenodo.org/record/4048304

While data retrieved from PxWeb APIs is maybe not as dynamic as other kinds of data but still occasionally changing (see stat.fi news page, there are some nice recommendations that could be at least acknowledged and, if possible, also implemented.

Here is a list of the recommendations:

Task	Status	Viability
R1 Data Versioning		Data versioning not supported PxWeb
R2 Timestamping		Timestamping dataset changes so that querying past datasets would be possible is not supported by PxWeb
R3 Query Store Facilities		Some pxweb database websites have "Save your query" menu but does not include all the data that WGDC recommends it should have
R4 Query Uniqueness		Pxweb interactive constructs queries in a normalised form, could also calculate MD5 hash to query
R5 Stable Sorting		Dataset sorting is determined by the sorting of raw data in server
R6 Result Set Verification		Fixity key for downloaded datasets, could be done with digest(dataset, algo = "md5")
R7 Query Timestamping	Done	Could also refer to the dataset date of last update
R8 Query PID		Assign a DOI, ARK, or similar PID to a unique query
R9 Store the Query	Done	Refers to R3 "facilities" but query is printed by pxweb and that can be put into article appendices
R10 Automated Citation Texts	Done
R11 Landing Page		Now citation links to .px dataset, proper landing pages with documentation might not be available for all databases. Stat.fi has "Statistics homepage" for most (all?) datasets / topics
R12 Machine Actionability		Link to metadata landing page or JSON file
R13 Technology Migration		Responsibility of API / db maintainers
R14 Migration Verification		Compare fixity (hash) information of queries and outputs and see if they are identical

Recommendations are grouped as follows: R1-3 "Preparing the Data and the Query Store", R4-10 "Persistently Identifying Specific Data Sets", R11-12 "Resolving PIDs and Retrieving the Data" and R13-14 "Upon modifications to the Data Infrastructure".

Especially interesting, in my opinion, would be to integrate the calculation of query and downloaded dataset hashes (R4, R6) and storing them somewhere alongside other citation data.

Additionally, R12 could be somewhat achieved by changing the URL in the following citation

  @Misc{,
    title = {Foreign languages selected by upper secondary level students by Year, Area, Gender, Level of education and Information},
    author = {{Statistics Finland}},
    organization = {Statistics Finland},
    address = {Helsinki, Finland},
    year = {2023},
    url = {https://statfin.stat.fi/PXWeb/api/v1/en/StatFin/ava/statfin_ava_pxt_12ad.px},
    note = {[Data accessed 2023-06-14 14:20:20.456548 using pxweb R package 0.16.3]},
  }

to simply https://stat.fi/en/statistics/ava which is closest equivalent to a landing page. I'm not sure if this URL is accessible from the API but it's listed at least in a separate csv file: https://statfin.stat.fi/database/StatFin/StatFin_rap.csv

R4 and R5 are kind of done if you use pxweb_interactive() as the order which items are printed in is very deterministic. If the order of query printout or dataset items is changed in any way md5 hashes change as well.

The different recommendations are, I think, most useful for Pxweb database maintainers and Pxweb developers in SCB, but we could do our own part to think about solutions to the proposed recommendations.

The text was updated successfully, but these errors were encountered:

MansMeg · 2023-06-15T05:35:09Z

These are really good ideas!

Landing pages are good, but we should ask that as a feature from the pxweb people because it is not part of the API. We want to avoid handling individual API information. Long term, I think we should probably remove the API catalogued and just refer to pxweb list of available APIs.

pitkant · 2023-06-16T09:28:51Z

Do you mean with "pxweb list of available APIs" this list: https://www.scb.se/en/services/statistical-programs-for-px-files/px-web/pxweb-examples/ ?

As I mentioned in #254 there are some broken APIs listed there (Taiwan, Örebro kommun) and there are several APIs that were not listed there. Therefore SCB's list does not seem to be the definitive list available.

I compared the same .px and .json files downloaded from stat.fi example page and noticed that actually .px files have more metadata included than .json files. An example of this is the statistics homepage:

CONTACT[en]("Enterprise openings (No.)")="<A HREF='https://stat.fi/en/statistics/aly' TARGET=_blank>Statistics' homepage</A>";

and a note that may or may not be of interest to the data user:

NOTE[en]="<A HREF='https://stat.fi/en/statistics/documentation/aly' TARGET=_blank>Documentation of statistics</A>##.. not applicable#.. not applicable###Due to a methodological change in the source data, the number of enterprise closures and the size of the stock "
"of enterprises have not been published for the last three quarters of 2017. # #The stock of enterprises cannot be aggregated over time periods.";

which is essentially the same information that is displayed on the PxWeb database web interface.

.px-file format seems to be relatively simple and probably easy to implement, especially if it is only used to extract certain type of metadata that is not included in .json files. While this has traditionally been left out of the scope of this package, I think adding the possibility of downloading more metadata in the format of .px files would be useful. Additionally, there are some reports of JSON-stat / JSON-stat 2 output being erroneous compared to .px output (statisticssweden/PxWeb#387).

JSON-stat format allows for extension property that can be anything and interestingly enough at least stat.fi json file has several extension properties. It could also be used for storing statistics documentation (landing page) and possible notes related to statistics dataset.

EDIT: Actually it seems that PxApi 2.0 is coming out (at least to beta testing) in Autumn 2023 so maybe some of these changes will be implemented then: https://www.scb.se/en/services/open-data-api/pxapi-2.0/

pitkant added the enhancement label Jun 14, 2023

pitkant mentioned this issue Aug 1, 2023

PxWeb and Data Citation Principles statisticssweden/PxWeb#480

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

pitkant commented Jun 14, 2023

MansMeg commented Jun 15, 2023

pitkant commented Jun 16, 2023 •

edited

Loading

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

Address RDA Working Group on Dynamic Data Citation (WGDC) recommendations #266

Comments

pitkant commented Jun 14, 2023

MansMeg commented Jun 15, 2023

pitkant commented Jun 16, 2023 • edited Loading

pitkant commented Jun 16, 2023 •

edited

Loading