How to import Abbyy generated ALTO

Converting ALTO to PAGE

In order to re/post-process (or just evaluate) results from ABBYY in OCR-D, you need to convert its ALTO output to PAGE first. (You can also get AbbyyXml if your license permits it, but this is not covered here.)

You can use ocrd-fileformat-transform for this, which wraps ocr-transform.sh which includes the prima-page-converter.

Problem

However, there are 2 problems with the output of ABBYY:

It does not set the /alto/Description/sourceImageInformation/fileName, so the PAGE-XML won't contain any /PcGts/Page/@imageFilename (which makes it impossible to process with OCR-D).
It has a bug which sets the wrong coordinates in the blocks' and lines' Shape elements. When ABBYY detects a skew, it internally rotates the images, which increases the pixel size. When it exports the detected segments, it needs to back-transform the coordinates – which it does w.r.t. the angle, but not the extra offset. That's true for the coordinates described by Shape elements, but not for the bounding box attributes (@HPOS, @VPOS, @WIDTH, @HEIGHT).

Solution

One can compensate for these by the following post-processing steps:

Align both fileGrps, original images and imported PAGE annotations, by their physical page IDs. (The PAGE files will have empty @imageFilename.) For each pair, create an (empty) PAGE for the image and add the segments from the existing PAGE. This can be done via ocrd-segment-replace-page.
Remove all the Shape elements – they are redundant anyway (as ABBYY does not yield polygons, only bounding boxes).

For example, when using the makefileization, the workflow could look like this:

# import from DFGViewer
FULLTEXT:
        ocrd workspace find -G FULLTEXT --download
        xmlstarlet ed --inplace -d //_:Shape FULLTEXT/* # fix 2
        ocrd workspace find -G ORIGINAL --download
        ocrd workspace prune-files # delete all other files (not downloaded)

# created by side effect above
ORIGINAL: ;

# convert
PAGE: FULLTEXT
PAGE: TOOL = ocrd-fileformat-transform
PAGE: PARAMS = "from-to": "alto page"   

# fix 1
PAGE2: ORIGINAL PAGE
PAGE2: TOOL = ocrd-segment-replace-page
PAGE2: OPTIONS = -P transform_coordinates false

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials

Discussions

Expert section on OCR-D- workflows

Particular workflow steps

Recommended workflows

Successful Workflows for Particular Material (Template)

Workflow Guide

Videos

Section on Ground Truth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to import Abbyy generated ALTO

Converting ALTO to PAGE

Problem

Solution

Clone this wiki locally