Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RERO 1 (Olive) - Incorrect coordinates to rescale #126

Open
piconti opened this issue Mar 20, 2024 · 2 comments
Open

RERO 1 (Olive) - Incorrect coordinates to rescale #126

piconti opened this issue Mar 20, 2024 · 2 comments
Assignees

Comments

@piconti
Copy link
Member

piconti commented Mar 20, 2024

This issue is part of the various patches planned and done as part of the March-April 2024 release.
More info on the patches can be found here, in issue #117, issue #74 and here.

For the RERO 1 (Olive) data, it has been found that a number of issues presented wrongly scaled coordinates, as described here.
Upon a closer look at the issue, it was identified that problem originated during the conversion of the image files to jpg 2000.
Based on the available images, several approaches/strategies existed, among which the 'png_highest' strategy, which consisted in selecting the image with the highest resolution among various options, where the resolution was present in the filename (eg. ['1/Img/Pg001.png', '1/Img/Pg001_157.png', '1/Img/Pg001_180.png', '2/Img/Pg002.png', '2/Img/Pg002_157.png', '2/Img/Pg002_180.png'] for LCS-1830-08-02-a).
Unfortunately, it happened that the selected image was NOT the one with the largest resolution, leading in the wrong scaling.

The chosen fixing approach was thus to:

  • identify all the issues for which the source file used (as described in the [issue-id]-image-info.json file created during scaling) was NOT the highest one available (in the corresponding issue's Document.zip archive containing all the images used as source).
  • Rescale them accordingly by a factor (dest_res/curr_res) where dest_rest was the smaller resolution (used to create the jp2 files) and curr_res was the largest one available.
    • Note: the naming here refers to the current scaling that has been applied in the code; since the jp2 images will NOT be recreated, the current coordinates are scaled according to the largest available resolution (curr_res), which is not the one used to generate the jp2 image, and according to which they need to be rescaled (dest_res).

However, upon implementation of this approach it has been found that:

  • Not all issues have a (non-empty) [issue-id]-image-info.json, detailing which image resolution was actually used.
  • Issues missing this file can have both correct or incorrect coordinates, and it's not immediately clear how one can identify which issues actually need to have their coordinates rescaled.

Currently, the titles for which we know that coordinates need rescaling are:

  • DLE, EXP, LBP, LES, LTF, LCG.

The titles which have issues with missing information that might or might not need rescaling are:

  • DLE, EXP, LBP, LES, LTF, LCG, LES, LNF, LSE, LCR, LCS, JDF, LVE.

To be noted that for issues for which we do not know for sure (not part of the first 5), no example of issues requiring rescaling has yet been found. However, examples of issues not needing coordinate rescaling, but being part of the first 5 have been found.

More investigating as to which rescaling could be applied in the uncertain cases is ongoing.

@piconti piconti self-assigned this Mar 20, 2024
@piconti
Copy link
Member Author

piconti commented Mar 20, 2024

After further investigation, the following conclusions have been made:
Each one of the titles with known coordinates issues will be patched, whereas the others will not.

For each title, the issues were separated into two groups: issues to be rescaled, and issues to investigate further.
All the issues in the issues to be rescaled, will be patched, because we have enough information ti fix them correctly.
The issues to investigate further are the ones for which the solution was identified on the basis of each title.

Based on each title's problems the following issues (originally part of issues to investigate further) will be patched:

  • DLE, LBP and LFT: All coordinates in all issues and pages will be rescaled.
  • LES: Issues for which .jpg files were present in the Document.zip archive will be rescaled.
  • LCG: Issues starting 1892 will be rescaled.
  • EXP: Issues between 1902 and 1910 (both included) will be rescaled.

In particular, another problem has been identified, that concerns several titles, but especially EXP. As a result, rescaling the coordinates for more recent EXP issues would not prove useful in fixing the coordinates issues present.
Further investigation will be required to fix this issue, and it will be more detailled here in the future, but it seems that the display problems vary between the facsimile and artcile view, so the origin of the problem would need to be identified first.

@piconti
Copy link
Member Author

piconti commented Mar 22, 2024

After further discussion, and identifying also significant issues in the LES data, it has been decided that EXP and LES would not be patched along with the other titles.
This is due to multiple factors:

  • The years that could be patched represent a small fraction of the title's data
  • The data for these titles contain other, quite significant errors that may be intertwined with this scaling problem, and also (among other things) have an impact on coordinates. Rescaling the coordinates as part of this patch would certainly not fix all problems and could create others.
    • The problem is in particular hard to locate exactly because it happens that hte article view shows correct image croppings.
  • We do not currently have the time or capacity to identify and fix these issues completely before the next release.
  • It is probable that a reingestion would be required for these titles.
  • Once the code for patching is written, nothing prevents one to re-run it for these issues/titles if need be.

However, for LES, it has also been identified that the full text of some articles is missing all its spaces.
Upon brief inspection, it appears that this is due to the canonical field "gn" (glue next [without a space]) is True for all tokens for a small number of years (seemingly starting in 2006). This was most probably hte case because the OCR for some articles has created tokens at the character level, instead of the word.
Both situations (character and word tokens) can be found within one single issue.
Since having some correctly spaces articles is better than having none, the rebuilt for these years (2006 and onwards) will probably be recomputed.
This would replace the current situation (all articles having no spaces at all) with one where some articles are normal, and others have spaces between each character.

@piconti piconti linked a pull request Apr 30, 2024 that will close this issue
@piconti piconti removed a link to a pull request Apr 30, 2024
@piconti piconti pinned this issue May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant