-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix invalid ci metadata #127
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…sso/impresso-text-acquisition into bugfix-invalid-ci-metadata
…sso/impresso-text-acquisition into bugfix-invalid-ci-metadata
…sso/impresso-text-acquisition into bugfix-invalid-ci-metadata
This was
linked to
issues
Apr 30, 2024
To build the docs correctly with autodoc, the data versioning branch impresso-commons had to be specifically provided in the requirements.txt . This should be removed and rechanged as soon as the branch is merged into master in the impresso-pycommons repository. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overall Description
This pull-request represents quite a substantial amount of patches and small fixes, as well as some additions which were implemented during October 2023 and March 2024.
In particular, the modifications brought by this branch aim to solve long-lasting issues with the IIIF links of image content-items in Issues which had been identified, as part of issues #103, #104 and #105 .
Upon closer inspection of the situation concerning these issues, it was found that the situation varied significantly from importer to importer, as described in issue #117 which aims at addressing all these issues together.
In addition, issue #74 about the wrong ordering of content-items was also targetted during this patching.
As part of the new and upcoming release (correcting the old data and adding newly obtained data), all the necessary patches and corrections to do were aggregated in this google sheet document, and some additional issues (#20, #126) ended up being opened or addressed through this patching.
When data was simply patched instead of being re-ingested, the corresponding scripts or notebooks were also added for traceability.
Finally, since a data versioning approach was implemented before the generation of the updated data, it was integrated into the text-importer core logic. This allowed us to track statistics on the generated canonical data and identify potential problems with the data during re-ingestion #116.
Note that all the data generated or patched for the next and upcoming release was done using this branch, and follow the updated JSON schemas as described in this pull-request. As a result the changes to be merged here reflect the updates made to the data.
Precise changes and patches
Now for a more precise and exhaustive list of changes and patches:
Changes made to the code of the importers
iiif_link
(insidem
) andc
(outsidem
) for content-items of type image in issuesiiif_link
orc
for content-items of type image in issues, one of:iiif_manifest_uri
property to issues when the institution provides a IIIF presentation APIiiif
property byiiif_img_base_uri
in pages, and adapting their values to only be the URI base (excludinginfo.json
or[...]/default.jpg
suffixes)ro
(reading order) property to the metadata (m
) of the content-items in issues to improve the Table of Contents display on the interface.generic_importer.py
andcore.py
: Integration of the manifest instantiation and computation. Now whenever data ingested with the text-importer, a corresponding data manifest will be generated along with it an uploaded to the corresponding S3 bucket.Patches implemented as scripts or added to correct existing problems
LES
andEXP
faced other more complex problems interacting with the wrong coordinates, which will be fixed at another time)Additionally, note that patch 4 (content-item - article matching for BNL) could not be handled as part of this PR as it would probably require a substantial rethinking of the importer's logic. This will be tackled in a future PR once the BNL data has arrived.
Based on all these changes, the version was updated to 1.1.0.
This PR closes issues: #103, #104, #105, #117, #74, #20, #116