├── README.md
├── logs
├── reports => report files for text / images
├── sanity_check => scripts for text / images
├── setup.py
├── test-data
└── test_sanity_check => unit tests for text / images
External factors:
- the journal stopped to be published for some years/months/days
- publication periodicity
Preservation factors:
- paper copies were lost and never digitized
- digitization happened but digital copies were lost or damaged.
Transparency about cases 1 and 2 supposes external knowledge, and should be encoded as metadata. Case 3 is the same, but it is not sure that this info is always nicely kept. Case 4 can detected via quality control of digitization process, or simply via their usage.
Text/Images
- for each text page, there should be a corresponding image
- for each article, there should be text
Olive
- check the presence/absence of issue pdfs
- check the presence/absence of page pdfs
- check presence of .zip per issue
- if not tifs, check presence of png and/or jpg
In the DB, store information so as to be able to (among others):
- detect missing/absent years of a journal
- detect missing/absent months of a year
- detect missing/absent days of a month
- detect missing image for a page
- detect missing article? => Is this possible?
Data to cross-check:
- canonical json (S3)
- images (NAS)
Sanity check strategy:
-
isolated checks for 1) text and 2) images, w.r.t. original data. To be run apart.
-
cross-check on generated canonical data, assuming that 1) was fine.
Isolated check on images
- detect corrupted archives
- overview of image format per journal, counts based on issues.
- per journal, total sum of jp2 file size
- per journal, total sum of all images (?)
Goes in the DB:
- in the journal table, how many images comes tif/png/jpg + %
Isolated check on text
- check pages without OCR (empty regions)
- corrupted archives
- per journal, total size of json files
- checking the xml for the articles
Goes in the DB:
- numb. of corrupted archives(issues) per journal
- numb. of pages without OCR per journal
Cross-check on file system:
-
Granularity level: issue (canonical id)
-
What needs to be checked:
- number of json page = number of image jp2
-
output: csv with everything
Sanity check on the DB
canonical_check
check_images.py --original-dir=/mnt/project_impresso/original/RERO --report-dir=../../reports/img_original --newspapers="BDC CDV EDA EXP JDV LCE LES LSR" --command="check_original" --log-file=../../logs/check-original-rero-batch1.log
The 'impresso - Media Monitoring of the Past' project is funded by the Swiss National Science Foundation (SNSF) under grant number CRSII5_173719 (Sinergia program). The project aims at developing tools to process and explore large-scale collections of historical newspapers, and at studying the impact of this new tooling on historical research practices. More information at https://impresso-project.ch.
Copyright (C) 2020 The impresso team (contributors to this program: Maud Ehrmann, Matteo Romanello).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Affero General Public License for more details.