introduce 2nd region level for paragraphs, table cells, footnote content #150

bertsky · 2020-04-28T22:16:29Z

The current OCR-D spec has a completely flat hierarchy of PAGE-XML segments.

However, there is a large demand for at least mildly recursive regions for:

paragraphs inside text regions
text regions comprising a drop-capital and a follow-up (connected) paragraph – concatenated without paragraph/line break
cells inside tables – no other way to represent their text content
text regions of any kind in footnotes

PAGE-XML of course defines all region types fully recursively, and designates @type="paragraph" etc.

Also, at least with ocrd-tesserocr-segment-table, we already have an implementation for 3. But this area needs much (coordinated) work. A more evolved specification would surely help steer the way for further implementations.

I don't think we are entirely incompatible with a paragraph level. (Or shall we call it subtype level?) It would probably be just routine work on a few formulations here and yaml enums there.

Our GT mostly already uses 2 levels for that – and rightly so, because this is most versatile. (It can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).

So I propose allowing (as an opt-in) for a mildly recursive region representation of 2 levels, with both a region level and an explicit paragraph / cell / drop-capital / subtype level in the functional model. This would raise to standard the current behaviour of ocrd-tesserocr-segment, which operates on 3 distinct output levels:

block segmentation from page to regions (of any type),
paragraph segmentation from text regions to paragraphs and from table regions to table cells (as a prerequisite for further representation),
line segmentation from paragraphs to text lines.

Originally posted by @bertsky in #135 (comment)

The text was updated successfully, but these errors were encountered:

bertsky added the enhancement label Apr 28, 2020

EEngl52 assigned cneud May 4, 2020

bertsky mentioned this issue May 20, 2020

support recursive regions OCR4all/LAREX#181

Open

bertsky mentioned this issue Sep 11, 2020

Line segmentation in tables OCR-D/ocrd_all#190

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce 2nd region level for paragraphs, table cells, footnote content #150

introduce 2nd region level for paragraphs, table cells, footnote content #150

bertsky commented Apr 28, 2020 •

edited

Loading

introduce 2nd region level for paragraphs, table cells, footnote content #150

introduce 2nd region level for paragraphs, table cells, footnote content #150

Comments

bertsky commented Apr 28, 2020 • edited Loading

bertsky commented Apr 28, 2020 •

edited

Loading