-
Notifications
You must be signed in to change notification settings - Fork 7
How to create ALTO for DFG Viewer
This example workflow from UB Mannheim uses OCR-D processors to modify an existing METS file in order to make new OCR results available for the DFG Viewer.
The ALTO files were created from PAGE XML as part of ocrd process
:
ocrd process [...] 'fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to "page alto"'
Note: Depending on the OCR-D workflow (e.g. whether or not cropping is used, what hierarchy level the OCR text is annotated on) and preferences (e.g. reading order representation, ALTO version), additional settings for the ALTO converter might be necessary. These can be passed as an additional argument
script-args
. For example:ocrd-fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to "page alto" -P script-args "--no-check-words --dummy-word --no-check-border --reading-order reading-order --textline-order index --trailing-dash-to-hyp --skip-empty-lines"
OCR-D also creates a METS file, but here the original METS file is processed further:
# Change to directory with new OCR results. cd /var/www/html/fileadmin/digi/477429599/ocrd-20200923 # Copy original METS file. cp ../477429599.xml . # Remove existing fulltext. ocrd workspace -m 477429599.xml remove-group --recursive --force FULLTEXT # Add new ALTO files from OCR results in directory FULLTEXT2. ocrd workspace -m 477429599.xml bulk-add --regex '^.*/FILE_(?P.*)_.*\.xml$' --page-id 'PHYS_{{ pageid }}' --file-grp FULLTEXT --mimetype text/xml --url 'alto/477429599_{{ pageid }}.xml' FULLTEXT2/*.xml # Replace local file references for ALTO files by URLs. perl -pi -e 's,LOCTYPE.*alto,LOCTYPE="URL" xlink:href="https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/ocrd-20200923/alto,' 477429599.xml
The new METS file is available from https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/ocrd-20200923/477429599.xml and can be shown with the DFG Viewer at https://dfg-viewer.de/show?tx_dlf[id]=https%3A%2F%2Fdigi.bib.uni-mannheim.de%2Ffileadmin%2Fdigi%2F477429599%2Focrd-20200923%2F477429599.xml.
So it is possible to compare for example the ABBYY Finereader OCR for page 55 with OCR-D Tesseract OCR.
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows