Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce 2nd region level for paragraphs, table cells, footnote content #150

Open
bertsky opened this issue Apr 28, 2020 · 0 comments
Open
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Apr 28, 2020

The current OCR-D spec has a completely flat hierarchy of PAGE-XML segments.

However, there is a large demand for at least mildly recursive regions for:

  1. paragraphs inside text regions
  2. text regions comprising a drop-capital and a follow-up (connected) paragraph – concatenated without paragraph/line break
  3. cells inside tables – no other way to represent their text content
  4. text regions of any kind in footnotes

PAGE-XML of course defines all region types fully recursively, and designates @type="paragraph" etc.

Also, at least with ocrd-tesserocr-segment-table, we already have an implementation for 3. But this area needs much (coordinated) work. A more evolved specification would surely help steer the way for further implementations.

I don't think we are entirely incompatible with a paragraph level. (Or shall we call it subtype level?) It would probably be just routine work on a few formulations here and yaml enums there.

Our GT mostly already uses 2 levels for that – and rightly so, because this is most versatile. (It can still be reduced to a flat regime, but can also be used for ANN segmentation training, for which a non-flat representation is the only way to cleanly separate visual from textual cues).

So I propose allowing (as an opt-in) for a mildly recursive region representation of 2 levels, with both a region level and an explicit paragraph / cell / drop-capital / subtype level in the functional model. This would raise to standard the current behaviour of ocrd-tesserocr-segment, which operates on 3 distinct output levels:

  1. block segmentation from page to regions (of any type),
  2. paragraph segmentation from text regions to paragraphs and from table regions to table cells (as a prerequisite for further representation),
  3. line segmentation from paragraphs to text lines.

Originally posted by @bertsky in #135 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants