Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix invalid ci metadata #127

Merged
merged 61 commits into from
May 1, 2024
Merged

Bugfix invalid ci metadata #127

merged 61 commits into from
May 1, 2024

Conversation

piconti
Copy link
Member

@piconti piconti commented Apr 30, 2024

Overall Description

This pull-request represents quite a substantial amount of patches and small fixes, as well as some additions which were implemented during October 2023 and March 2024.

In particular, the modifications brought by this branch aim to solve long-lasting issues with the IIIF links of image content-items in Issues which had been identified, as part of issues #103, #104 and #105 .
Upon closer inspection of the situation concerning these issues, it was found that the situation varied significantly from importer to importer, as described in issue #117 which aims at addressing all these issues together.
In addition, issue #74 about the wrong ordering of content-items was also targetted during this patching.

As part of the new and upcoming release (correcting the old data and adding newly obtained data), all the necessary patches and corrections to do were aggregated in this google sheet document, and some additional issues (#20, #126) ended up being opened or addressed through this patching.
When data was simply patched instead of being re-ingested, the corresponding scripts or notebooks were also added for traceability.

Finally, since a data versioning approach was implemented before the generation of the updated data, it was integrated into the text-importer core logic. This allowed us to track statistics on the generated canonical data and identify potential problems with the data during re-ingestion #116.
Note that all the data generated or patched for the next and upcoming release was done using this branch, and follow the updated JSON schemas as described in this pull-request. As a result the changes to be merged here reflect the updates made to the data.

Precise changes and patches

Now for a more precise and exhaustive list of changes and patches:

Changes made to the code of the importers

  1. BNF, BNF-EN, RERO: Correction of the placement of iiif_link (inside m) and c (outside m) for content-items of type image in issues
  2. BNF, BNF-EN, RERO, BNL: Correction of the value of iiif_link or c for content-items of type image in issues, one of:
  3. SWA, BNF, BNF-EN, FedGaz: Addition of the newly defined iiif_manifest_uri property to issues when the institution provides a IIIF presentation API
  4. All importers: Replacement of the iiif property by iiif_img_base_uri in pages, and adapting their values to only be the URI base (excluding info.json or [...]/default.jpg suffixes)
  5. BNF, BNF-EN, BNL (Lux), RERO (2 & 3): Addition of the ro (reading order) property to the metadata (m) of the content-items in issues to improve the Table of Contents display on the interface.
  6. generic_importer.py and core.py: Integration of the manifest instantiation and computation. Now whenever data ingested with the text-importer, a corresponding data manifest will be generated along with it an uploaded to the corresponding S3 bucket.

Patches implemented as scripts or added to correct existing problems

  1. RERO1 - Olive: Correction of the coordinates by means of a rescaling, as described here

Additionally, note that patch 4 (content-item - article matching for BNL) could not be handled as part of this PR as it would probably require a substantial rethinking of the importer's logic. This will be tackled in a future PR once the BNL data has arrived.

Based on all these changes, the version was updated to 1.1.0.
This PR closes issues: #103, #104, #105, #117, #74, #20, #116

piconti and others added 30 commits October 18, 2023 17:17
@piconti
Copy link
Member Author

piconti commented May 1, 2024

To build the docs correctly with autodoc, the data versioning branch impresso-commons had to be specifically provided in the requirements.txt . This should be removed and rechanged as soon as the branch is merged into master in the impresso-pycommons repository.

@piconti piconti merged commit 213e304 into master May 1, 2024
1 check passed
@piconti piconti deleted the bugfix-invalid-ci-metadata branch August 22, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment