You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current OCR-D spec has a completely flat hierarchy of PAGE-XML segments.
However, there is a large demand for at least mildly recursive regions for:
paragraphs inside text regions
text regions comprising a drop-capital and a follow-up (connected) paragraph – concatenated without paragraph/line break
cells inside tables – no other way to represent their text content
text regions of any kind in footnotes
PAGE-XML of course defines all region types fully recursively, and designates @type="paragraph" etc.
Also, at least with ocrd-tesserocr-segment-table, we already have an implementation for 3. But this area needs much (coordinated) work. A more evolved specification would surely help steer the way for further implementations.
I don't think we are entirely incompatible with a paragraph level. (Or shall we call it subtype level?) It would probably be just routine work on a few formulations here and yaml enums there.
Our GT mostly already uses 2 levels for that – and rightly so, because this is most versatile. (It can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).
So I propose allowing (as an opt-in) for a mildly recursive region representation of 2 levels, with both a region level and an explicit paragraph / cell / drop-capital / subtype level in the functional model. This would raise to standard the current behaviour of ocrd-tesserocr-segment, which operates on 3 distinct output levels:
block segmentation from page to regions (of any type),
paragraph segmentation from text regions to paragraphs and from table regions to table cells (as a prerequisite for further representation),
The current OCR-D spec has a completely flat hierarchy of PAGE-XML segments.
However, there is a large demand for at least mildly recursive regions for:
PAGE-XML of course defines all region types fully recursively, and designates
@type="paragraph"
etc.Also, at least with
ocrd-tesserocr-segment-table
, we already have an implementation for 3. But this area needs much (coordinated) work. A more evolved specification would surely help steer the way for further implementations.I don't think we are entirely incompatible with a
paragraph
level. (Or shall we call itsubtype
level?) It would probably be just routine work on a few formulations here and yaml enums there.Our GT mostly already uses 2 levels for that – and rightly so, because this is most versatile. (It can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).
So I propose allowing (as an opt-in) for a mildly recursive region representation of 2 levels, with both a
region
level and an explicitparagraph
/cell
/drop-capital
/subtype
level in the functional model. This would raise to standard the current behaviour ofocrd-tesserocr-segment
, which operates on 3 distinct output levels:Originally posted by @bertsky in #135 (comment)
The text was updated successfully, but these errors were encountered: