-
Notifications
You must be signed in to change notification settings - Fork 7
How to import Abbyy generated ALTO
Elisabeth Engl edited this page Dec 4, 2020
·
2 revisions
In order to re/post-process (or just evaluate) results from ABBYY in OCR-D, you need to convert its ALTO output to PAGE first. (You can also get AbbyyXml if your license permits it, but this is not covered here.)
You can use ocrd-fileformat-transform for this, which wraps ocr-transform.sh which includes the prima-page-converter.
However, there are 2 problems with the output of ABBYY:
- It does not set the
/alto/Description/sourceImageInformation/fileName
, so the PAGE-XML won't contain any/PcGts/Page/@imageFilename
(which makes it impossible to process with OCR-D). - It has a bug which sets the wrong coordinates in the blocks' and lines'
Shape
elements. When ABBYY detects a skew, it internally rotates the images, which increases the pixel size. When it exports the detected segments, it needs to back-transform the coordinates – which it does w.r.t. the angle, but not the extra offset. That's true for the coordinates described byShape
elements, but not for the bounding box attributes (@HPOS
,@VPOS
,@WIDTH
,@HEIGHT
).
One can compensate for these by the following post-processing steps:
- Align both fileGrps, original images and imported PAGE annotations, by their physical page IDs. (The PAGE files will have empty
@imageFilename
.) For each pair, create an (empty) PAGE for the image and add the segments from the existing PAGE. This can be done viaocrd-segment-replace-page
. - Remove all the
Shape
elements – they are redundant anyway (as ABBYY does not yield polygons, only bounding boxes).
For example, when using the makefileization, the workflow could look like this:
# import from DFGViewer
FULLTEXT:
ocrd workspace find -G FULLTEXT --download
xmlstarlet ed --inplace -d //_:Shape FULLTEXT/* # fix 2
ocrd workspace find -G ORIGINAL --download
ocrd workspace prune-files # delete all other files (not downloaded)
# created by side effect above
ORIGINAL: ;
# convert
PAGE: FULLTEXT
PAGE: TOOL = ocrd-fileformat-transform
PAGE: PARAMS = "from-to": "alto page"
# fix 1
PAGE2: ORIGINAL PAGE
PAGE2: TOOL = ocrd-segment-replace-page
PAGE2: OPTIONS = -P transform_coordinates false
Welcome to the OCR-D wiki, a companion to the OCR-D website.
Articles and tutorials
- Running OCR-D on macOS
- Running OCR-D in Windows 10 with Windows Subsystem for Linux
- Running OCR-D on POWER8 (IBM pSeries)
- Running browse-ocrd in a Docker container
- OCR-D Installation on NVIDIA Jetson Nano and Xavier
- Mapping PAGE to ALTO
- Comparison of OCR formats (outdated)
- A Practicioner's View on Binarization
- How to use the bulk-add command to generate workspaces from existing files
- Evaluation of (intermediary) steps of an OCR workflow
- A quickstart guide to ocrd workspace
- Introduction to parameters in OCR-D
- Introduction to OCR-D processors
- Introduction to OCR-D workflows
- Visualizing (intermediate) OCR-D-results
- Guide to updating ocrd workspace calls for 2.15.0+
- Introduction to Docker in OCR-D
- How to import Abbyy-generated ALTO
- How to create ALTO for DFG Viewer
- How to create searchable fulltext data for DFG Viewer
- Setup native CUDA Toolkit for Qurator tools on Ubuntu 18.04
- OCR-D Code Review Guidelines
- OCR-D Recommendations for Using CI in Your Repository
Expert section on OCR-D- workflows
Particular workflow steps
Workflow Guide
- Workflow Guide: preprocessing
- Workflow Guide: binarization
- Workflow Guide: cropping
- Workflow Guide: denoising
- Workflow Guide: deskewing
- Workflow Guide: dewarping
- Workflow Guide: region-segmentation
- Workflow Guide: clipping
- Workflow Guide: line-segmentation
- Workflow Guide: resegmentation
- Workflow Guide: olr-evaluation
- Workflow Guide: text-recognition
- Workflow Guide: text-alignment
- Workflow Guide: post-correction
- Workflow Guide: ocr-evaluation
- Workflow Guide: adaptation-of-coordinates
- Workflow Guide: format-conversion
- Workflow Guide: generic transformations
- Workflow Guide: dummy processing
- Workflow Guide: archiving
- Workflow Guide: recommended workflows