Skip to content

How to create ALTO for DFG Viewer

Robert Sachunsky edited this page Feb 7, 2022 · 2 revisions

How to create ALTO for DFG Viewer

This example workflow from UB Mannheim uses OCR-D processors to modify an existing METS file in order to make new OCR results available for the DFG Viewer.

The ALTO files were created from PAGE XML as part of ocrd process:

ocrd process [...] 'fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to "page alto"'

Note: Depending on the OCR-D workflow (e.g. whether or not cropping is used, what hierarchy level the OCR text is annotated on) and preferences (e.g. reading order representation, ALTO version), additional settings for the ALTO converter might be necessary. These can be passed as an additional argument script-args. For example:

ocrd-fileformat-transform -I OCR-D-OCR-TESS -O FULLTEXT2 -P from-to "page alto" -P script-args "--no-check-words --dummy-word --no-check-border --reading-order reading-order --textline-order index --trailing-dash-to-hyp --skip-empty-lines"

OCR-D also creates a METS file, but here the original METS file is processed further:

# Change to directory with new OCR results.
cd /var/www/html/fileadmin/digi/477429599/ocrd-20200923

# Copy original METS file.
cp ../477429599.xml .

# Remove existing fulltext.
ocrd workspace -m 477429599.xml remove-group --recursive --force FULLTEXT

# Add new ALTO files from OCR results in directory FULLTEXT2.
ocrd workspace -m 477429599.xml bulk-add --regex '^.*/FILE_(?P.*)_.*\.xml$' --page-id 'PHYS_{{ pageid }}' --file-grp FULLTEXT --mimetype text/xml --url 'alto/477429599_{{ pageid }}.xml' FULLTEXT2/*.xml

# Replace local file references for ALTO files by URLs.
perl -pi -e 's,LOCTYPE.*alto,LOCTYPE="URL" xlink:href="https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/ocrd-20200923/alto,' 477429599.xml

The new METS file is available from https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/ocrd-20200923/477429599.xml and can be shown with the DFG Viewer at https://dfg-viewer.de/show?tx_dlf[id]=https%3A%2F%2Fdigi.bib.uni-mannheim.de%2Ffileadmin%2Fdigi%2F477429599%2Focrd-20200923%2F477429599.xml.

So it is possible to compare for example the ABBYY Finereader OCR for page 55 with OCR-D Tesseract OCR.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally